Top-K Sampling

Top-K sampling is a decoding method used in language models to control output randomness and maintain relevance during token generation. It restricts token selection to the K most probable tokens, then randomly samples from that fixed subset.

Purpose

It ensures that only the most likely words are considered at each generation step, which prevents the model from selecting very low-probability (nonsensical) tokens while still allowing variability.

How It Works

  1. Compute probabilities for all tokens.
  2. Select the top K tokens with the highest probabilities.
  3. Normalize those probabilities.
  4. Randomly sample the next token from this Top-K set.

Parameter

  • K (int): Number of top tokens to sample from.
    • Lower values → more deterministic
    • Higher values → more randomness and creativity

Comparison: Top-K vs Top-P

Strategy Description Behavior
Top-K Fixed number of tokens Consistent, may ignore tail
Top-P Dynamic set based on cumulative prob. Adaptive, may include low-prob.

Use Cases

  • Chatbots with controlled tone
  • Structured content generation
  • Code generation (K = 1–20 typically)

Tips

  • Often used with Temperature to fine-tune variability.
  • A value of K=40 is common in creative writing tasks.
  • Low values (1-10) produce conservative, factual outputs. Medium values (20-50) balance creativity and quality. High values (50+) enable diverse, creative outputs.