Context Window

The context window refers to the maximum number of tokens a language model can process in a single request. It includes both the input prompt and the generated output.

Why It Matters

LLMs have a fixed memory window. If the total number of tokens exceeds this limit, the oldest tokens are truncated, which can lead to loss of important context and degrade output quality.

Example

  • Model: GPT-4-1106-preview
  • Max context window: 128,000 tokens
  • If your input prompt is 100,000 tokens, you can only generate up to 28,000 tokens in response.

Implications

  • Prompt Engineering: You must strategically fit your instructions, examples, and context within the limit.
  • RAG Pipelines: Retrieved content must be chunked to fit within available space.
  • Long Conversations: Older messages may be forgotten unless explicitly re-injected.
  • Streaming / Iterative Outputs: Large tasks may need to be broken up across multiple calls.

Token Budgeting Tips

  • Pre-tokenize your input using tools like tiktoken or Hugging Face tokenizers.
  • Keep prompts concise and remove redundant content.
  • Consider truncating or summarizing less-relevant context.
  • Structure your prompt to prioritize high-value content near the end (least likely to be truncated).