Tokenization
Text becomes numbers. Numbers become predictions.
Tokenization
A language model never sees text. It sees numbers — token IDs. Tokenization is the translation step: raw text in, integer sequence out.
Analogy
Think of a delicatessen where every item on the counter has a numbered sticker. When you order a sandwich, the cashier does not write "turkey on rye, no mayo" — they punch in 47, 12, 3. The kitchen only ever sees numbers. The menu (vocabulary) decides what chunks exist: "turkey" is one sticker, but an unusual request like "capicola" might get split into "capi" and "cola". Two delis a block apart will number the same sandwich completely differently.
What a token is
A token is the smallest unit the model processes. It is neither a character nor a word — it is a learned chunk. Common subword tokenizers (BPE, WordPiece, Unigram) build a vocabulary of roughly 50,000–100,000 chunks by compressing frequent byte patterns.
| Input | Tokens |
|---|---|
tokenization |
token + ization |
GPT |
G + PT |
unhappiness |
un + happiness |
Single characters are always representable (the vocabulary includes all bytes). Common words are often a single token. Rare words or code fragments split into many tokens.
Why it matters
Context window is measured in tokens, not words. gpt-4o has a 128,000-token window. A token is roughly 0.75 English words. A 100-page document is roughly 25,000 tokens.
Cost and speed scale with token count. A prompt that says the same thing in fewer tokens is cheaper.
Model behaviour changes at token boundaries. 2023 is one token. 2 0 2 3 is four. Arithmetic works differently on each.
The context window bar
Think of the context window as a fixed-length tape. Each new token appends to the right. When the tape is full, the oldest tokens fall off the left. Nothing outside the window exists to the model.
| Position | Meaning |
|---|---|
| Earlier tokens | Older context, may be forgotten |
| Later tokens | More recent, always present |
| System prompt | Usually at position 0, always present if it fits |
Three tokenizer families
Byte-Pair Encoding (BPE) — used by GPT models. Starts with individual bytes, iteratively merges the most frequent adjacent pair until the target vocabulary size is reached.
WordPiece — used by BERT. Similar to BPE but merges pairs that maximize the likelihood of the training data rather than raw frequency.
SentencePiece / Unigram — language-agnostic, trains on raw bytes without a pre-tokenization step. Common in multilingual models.
All three produce different tokenizations of the same text. A token ID of 1234 in GPT-4 means something different from token 1234 in Llama-3.
What the model sees
After tokenization, each token ID is looked up in an embedding table — a learned matrix with one row per token. The sequence of embedding vectors is what enters the transformer. The text itself is gone.
"hello world"
→ [15339, 1917] (token IDs via BPE)
→ [[0.12, -0.34, …], (embedding row 15339)
[0.88, 0.01, …]] (embedding row 1917)
The model operates entirely in this numeric space. Tokenization is the bridge.