Context Window

The context window refers to the maximum number of tokens a language model can process in a single request. It includes both the input prompt and the generated output.

Why It Matters

LLMs have a fixed memory window. If the total number of tokens exceeds this limit, the oldest tokens are truncated, which can lead to loss of important context and degrade output quality.

Example

Model: GPT-4-1106-preview
Max context window: 128,000 tokens
If your input prompt is 100,000 tokens, you can only generate up to 28,000 tokens in response.

Implications

Prompt Engineering: You must strategically fit your instructions, examples, and context within the limit.
RAG Pipelines: Retrieved content must be chunked to fit within available space.
Long Conversations: Older messages may be forgotten unless explicitly re-injected.
Streaming / Iterative Outputs: Large tasks may need to be broken up across multiple calls.

Token Budgeting Tips

Pre-tokenize your input using tools like tiktoken or Hugging Face tokenizers.
Keep prompts concise and remove redundant content.
Consider truncating or summarizing less-relevant context.
Structure your prompt to prioritize high-value content near the end (least likely to be truncated).

Links to this note

Hallucinations

occur when a language model generates content that is confidently wrong, fabricated, or misleading, despite sounding plausible. These

Tokens

are the basic units of text that Large Language Models (LLMs) use for processing and generation. A token can represent a full word, part of a

Prompt Engineering

LLMs Prompt Engineering Prompt Engineering LLMs Tokens Context Window Hallucination AI Agents Prompt Injection Model Weights and Parameters

Models by Anthropic

Anthropic is the AI research company behind the Claude family of large language models. These models are designed with a strong emphasis on safety,

Context Window

Why It Matters

Example

Implications

Token Budgeting Tips

Related Notes