ai · level 1

Tokenization

Text becomes numbers. Numbers become predictions.

150 XP

Tokenization

A language model never sees text. It sees numbers — token IDs. Tokenization is the translation step: raw text in, integer sequence out.

Analogy

Think of a delicatessen where every item on the counter has a numbered sticker. When you order a sandwich, the cashier does not write "turkey on rye, no mayo" — they punch in 47, 12, 3. The kitchen only ever sees numbers. The menu (vocabulary) decides what chunks exist: "turkey" is one sticker, but an unusual request like "capicola" might get split into "capi" and "cola". Two delis a block apart will number the same sandwich completely differently.

What a token is

A token is the smallest unit the model processes. It is neither a character nor a word — it is a learned chunk. Common subword tokenizers (BPE, WordPiece, Unigram) build a vocabulary of roughly 50,000–100,000 chunks by compressing frequent byte patterns.

Input	Tokens
`tokenization`	`token` + `ization`
`GPT`	`G` + `PT`
`unhappiness`	`un` + `happiness`

Single characters are always representable (the vocabulary includes all bytes). Common words are often a single token. Rare words or code fragments split into many tokens.

Why it matters

Context window is measured in tokens, not words. gpt-4o has a 128,000-token window. A token is roughly 0.75 English words. A 100-page document is roughly 25,000 tokens.

Cost and speed scale with token count. A prompt that says the same thing in fewer tokens is cheaper.

Model behaviour changes at token boundaries. 2023 is one token. 2 0 2 3 is four. Arithmetic works differently on each.

The context window bar

Think of the context window as a fixed-length tape. Each new token appends to the right. When the tape is full, the oldest tokens fall off the left. Nothing outside the window exists to the model.

Position	Meaning
Earlier tokens	Older context, may be forgotten
Later tokens	More recent, always present
System prompt	Usually at position 0, always present if it fits

Three tokenizer families

Byte-Pair Encoding (BPE) — used by GPT models. Starts with individual bytes, iteratively merges the most frequent adjacent pair until the target vocabulary size is reached.

WordPiece — used by BERT. Similar to BPE but merges pairs that maximize the likelihood of the training data rather than raw frequency.

SentencePiece / Unigram — language-agnostic, trains on raw bytes without a pre-tokenization step. Common in multilingual models.

All three produce different tokenizations of the same text. A token ID of 1234 in GPT-4 means something different from token 1234 in Llama-3.

What the model sees

After tokenization, each token ID is looked up in an embedding table — a learned matrix with one row per token. The sequence of embedding vectors is what enters the transformer. The text itself is gone.

"hello world"
  → [15339, 1917]         (token IDs via BPE)
  → [[0.12, -0.34, …],   (embedding row 15339)
     [0.88,  0.01, …]]   (embedding row 1917)

The model operates entirely in this numeric space. Tokenization is the bridge.