Tokens
Tokens are the basic units of text that Large Language Models (LLMs) use for processing and generation. A token can represent a full word, part of a word (subword), punctuation, or even whitespace—depending on the model’s tokenizer.
Purpose
LLMs don’t operate on raw text—they operate on sequences of tokens. All generation, memory limits, and costs are measured in terms of tokens.
How Tokenization Works
- Input text is broken down using a tokenizer.
- Common schemes: Byte Pair Encoding (BPE), WordPiece, SentencePiece.
- For example:
- “ChatGPT is great!” →
["Chat", "G", "PT", " is", " great", "!"]
- “ChatGPT is great!” →
Why Tokens Matter
- Prediction Unit: LLMs predict one token at a time.
- Cost Unit: OpenAI, Anthropic, etc., bill usage per token.
- Limit Unit: Each model has a max token window (e.g., 4k, 8k, 128k).
- Performance Impact: More tokens = more compute and latency.
Estimating Token Counts
- English: ~1 token ≈ ¾ of a word.
- 100 tokens ≈ 75 words ≈ 5–7 sentences (rough estimate).
- Tools: OpenAI’s tokenizer, tiktoken (Python lib), Hugging Face tokenizers.
Use Cases
- Token counting for cost estimation.
- Truncating or chunking input to fit within model limits.
- Token-level control in generation tasks (e.g., summaries, classification).
Practical Considerations
- Always account for both input and output tokens in API usage.
- Use tokenizer libraries to test and preview token behavior.
- Prompt structure can greatly affect tokenization and model efficiency.
Related Notes
Links to this note
Context Window
The context window refers to the maximum number of tokens a language model can process in a single request. It includes both the input prompt and
Prompt Engineering
LLMs Prompt Engineering Prompt Engineering LLMs Tokens Context Window Hallucination AI Agents Prompt Injection Model Weights and Parameters