Top-K Sampling

Top-K sampling is a decoding method used in language models to control output randomness and maintain relevance during token generation. It restricts token selection to the K most probable tokens, then randomly samples from that fixed subset.

Purpose

It ensures that only the most likely words are considered at each generation step, which prevents the model from selecting very low-probability (nonsensical) tokens while still allowing variability.

How It Works

Compute probabilities for all tokens.
Select the top K tokens with the highest probabilities.
Normalize those probabilities.
Randomly sample the next token from this Top-K set.

Parameter

K (int): Number of top tokens to sample from.
- Lower values → more deterministic
- Higher values → more randomness and creativity

Comparison: Top-K vs Top-P

Strategy	Description	Behavior
Top-K	Fixed number of tokens	Consistent, may ignore tail
Top-P	Dynamic set based on cumulative prob.	Adaptive, may include low-prob.

Use Cases

Chatbots with controlled tone
Structured content generation
Code generation (K = 1–20 typically)

Tips

Often used with Temperature to fine-tune variability.
A value of K=40 is common in creative writing tasks.
Low values (1-10) produce conservative, factual outputs. Medium values (20-50) balance creativity and quality. High values (50+) enable diverse, creative outputs.

Links to this note

Temperature (Sampling Parameter)

Temperature is a key sampling parameter that controls the level of randomness in language model outputs. It adjusts the probability distribution

Sampling Parameters

Sampling parameters control how language models choose the next token during generation. They directly affect the randomness, coherence, and

Top-P (Nucleus Sampling)

Top-P sampling—also called nucleus sampling—is a decoding strategy used in language models to generate more coherent and diverse text. It selects

Decoding Strategies

Decoding strategies define how language models convert predicted token probabilities into coherent text. They control determinism, diversity, and

Greedy Decoding

Greedy decoding is the simplest decoding strategy for language models. At each generation step, it selects the token with the highest probability,

Prompt Engineering

LLMs Prompt Engineering Prompt Engineering LLMs Tokens Context Window Hallucination AI Agents Prompt Injection Model Weights and Parameters