Sampling Parameters

Sampling parameters control how language models choose the next token during generation. They directly affect the randomness, coherence, and creativity of outputs, making them critical tools for fine-tuning model behavior.

Core Parameters

Temperature: Controls randomness by adjusting the shape of the probability distribution. Lower values = more deterministic, higher = more diverse. See: Temperature (Sampling Parameter)
Top-K Sampling: Limits token selection to the top K most probable tokens. Reduces risk of low-probability junk while allowing variation. See: Top-K Sampling
Top-P Sampling (Nucleus): Chooses from the smallest set of tokens whose cumulative probability exceeds P. More dynamic than Top-K. See: Top-P (Nucleus Sampling)
Frequency Penalty: Decreases the likelihood of a token being selected again based on its frequency in the current text. Helps reduce repetition.
Presence Penalty: Penalizes tokens that have already appeared, regardless of frequency. Encourages topic diversity.
Max Tokens: Limits the total number of tokens generated in a response.
Stop Sequences: Specifies one or more sequences that signal the model to stop generating further output.

Tuning Strategy

Use low temperature + top-k for accuracy and structure.
Use higher temperature + top-p for creativity or ideation tasks.
Combine with penalties to manage repetition and focus.
A balanced starting point is temperature 0.2, top-P 0.95, top-K 30 for coherent but creative results.
Use Temperature 0 for factual tasks.

Practical Use Cases

Chatbots: Tune for consistency and tone.
Content Generation: Balance creativity and relevance.
Code Completion: Favor deterministic generation for reliability.

Links to this note

Temperature (Sampling Parameter)

Temperature is a key sampling parameter that controls the level of randomness in language model outputs. It adjusts the probability distribution

Top-K Sampling

Top-K sampling is a decoding method used in language models to control output randomness and maintain relevance during token generation. It

Top-P (Nucleus Sampling)

Top-P sampling—also called nucleus sampling—is a decoding strategy used in language models to generate more coherent and diverse text. It selects

Decoding Strategies

Decoding strategies define how language models convert predicted token probabilities into coherent text. They control determinism, diversity, and

Greedy Decoding

Greedy decoding is the simplest decoding strategy for language models. At each generation step, it selects the token with the highest probability,

Prompt Engineering

Prompt engineering is the practice of crafting effective input prompts to guide language models (LLMs) toward desired outputs. It is a core skill

Prompt

A prompt is the input or instruction given to a Large Language Model (LLM) to guide its response. It acts as the starting point for generation,

Tokens

are the basic units of text that Large Language Models (LLMs) use for processing and generation. A token can represent a full word, part of a

Prompt Engineering

LLMs Prompt Engineering Prompt Engineering LLMs Tokens Context Window Hallucination AI Agents Prompt Injection Model Weights and Parameters

Models by Google

Google is a major contributor to the development of Large Language Models (LLMs), producing powerful models such as PaLM, Gemini, and powering

Sampling Parameters

Core Parameters

Tuning Strategy

Practical Use Cases

Related Notes