Tokens

Tokens are the basic units of text that Large Language Models (LLMs) use for processing and generation. A token can represent a full word, part of a word (subword), punctuation, or even whitespace—depending on the model’s tokenizer.

Purpose

LLMs don’t operate on raw text—they operate on sequences of tokens. All generation, memory limits, and costs are measured in terms of tokens.

How Tokenization Works

Input text is broken down using a tokenizer.
Common schemes: Byte Pair Encoding (BPE), WordPiece, SentencePiece.
For example:
- “ChatGPT is great!” → ["Chat", "G", "PT", " is", " great", "!"]

Why Tokens Matter

Prediction Unit: LLMs predict one token at a time.
Cost Unit: OpenAI, Anthropic, etc., bill usage per token.
Limit Unit: Each model has a max token window (e.g., 4k, 8k, 128k).
Performance Impact: More tokens = more compute and latency.

Estimating Token Counts

English: ~1 token ≈ ¾ of a word.
100 tokens ≈ 75 words ≈ 5–7 sentences (rough estimate).
Tools: OpenAI’s tokenizer, tiktoken (Python lib), Hugging Face tokenizers.

Use Cases

Token counting for cost estimation.
Truncating or chunking input to fit within model limits.
Token-level control in generation tasks (e.g., summaries, classification).

Practical Considerations

Always account for both input and output tokens in API usage.
Use tokenizer libraries to test and preview token behavior.
Prompt structure can greatly affect tokenization and model efficiency.

Links to this note

Context Window

The context window refers to the maximum number of tokens a language model can process in a single request. It includes both the input prompt and

Prompt Engineering

LLMs Prompt Engineering Prompt Engineering LLMs Tokens Context Window Hallucination AI Agents Prompt Injection Model Weights and Parameters