Embeddings
Vectors encode meaning. Proximity implies similarity.
Embeddings
After tokenization, each token ID is mapped to a dense vector — its embedding. These vectors are not hand-crafted. They are learned. The geometry that emerges encodes meaning.
Analogy
Picture a giant library where librarians have quietly been rearranging the shelves for years based on which books get checked out together. Cookbooks drift near gardening books, which drift near beekeeping. Nobody labelled the zones — reader behaviour did. A new visitor asking for "sourdough" can wander one aisle and find the right neighbourhood without ever knowing the Dewey system. The aisle position is the meaning.
What an embedding is
An embedding is a point in high-dimensional space. Modern models use 768–12,288 dimensions. Two tokens that appear in similar contexts end up geometrically close. Distance in embedding space approximates semantic similarity.
embed("king") → [0.24, -0.81, 0.37, …] (12,288 floats)
embed("queen") → [0.22, -0.83, 0.39, …] (12,288 floats)
distance = 0.08 (very close)
Why geometry encodes meaning
The model learns embeddings by predicting context. If "cat" and "kitten" appear in similar sentences, gradient descent pushes their vectors together. No human labels the relationship — the structure of language does it automatically.
This produces useful algebraic structure:
embed("king") - embed("man") + embed("woman") ≈ embed("queen")
The famous word analogy works because gender and royalty are separable directions in embedding space.
Clusters in the space
Words cluster by type without supervision:
| Cluster | Examples |
|---|---|
| Countries | France, Germany, Japan |
| Animals | cat, dog, lion |
| Actions | run, jump, walk |
| Emotions | happy, sad, angry |
The model never saw these categories labeled. They self-organized from co-occurrence patterns in training text.
Token embeddings vs. positional embeddings
The embedding table maps token ID → vector. But the model also needs to know where in the sequence a token appears.
A separate positional embedding encodes sequence position (0, 1, 2, …) as another vector. The two are added:
input_vector[i] = token_embedding[id_i] + position_embedding[i]
The transformer then operates on these combined vectors. Without positional embeddings, the model would treat "dog bites man" and "man bites dog" identically.
What changes during training
The embedding table starts randomly initialized. Every gradient update adjusts every entry. By the end of pre-training, the table has learned a compressed world model — word sense, syntactic role, domain, register, and more — all encoded in a fixed number of floats per token.
Retrieval-augmented generation (RAG)
Embeddings extend beyond individual tokens. Whole sentences or documents can be embedded by aggregating token vectors. These document embeddings power semantic search: given a query, find the stored documents whose embeddings are nearest.
That nearest-neighbor lookup is the core of RAG. The retrieved chunks are added to the prompt before the model answers, giving it access to information not in its weights.