ai · level 2

Embeddings

Vectors encode meaning. Proximity implies similarity.

150 XP

Embeddings

After tokenization, each token ID is mapped to a dense vector — its embedding. These vectors are not hand-crafted. They are learned. The geometry that emerges encodes meaning.

Analogy

Picture a giant library where librarians have quietly been rearranging the shelves for years based on which books get checked out together. Cookbooks drift near gardening books, which drift near beekeeping. Nobody labelled the zones — reader behaviour did. A new visitor asking for "sourdough" can wander one aisle and find the right neighbourhood without ever knowing the Dewey system. The aisle position is the meaning.

What an embedding is

An embedding is a point in high-dimensional space. Modern models use 768–12,288 dimensions. Two tokens that appear in similar contexts end up geometrically close. Distance in embedding space approximates semantic similarity.

embed("king")   → [0.24, -0.81, 0.37, …]   (12,288 floats)
embed("queen")  → [0.22, -0.83, 0.39, …]   (12,288 floats)
distance = 0.08                              (very close)

Why geometry encodes meaning

The model learns embeddings by predicting context. If "cat" and "kitten" appear in similar sentences, gradient descent pushes their vectors together. No human labels the relationship — the structure of language does it automatically.

This produces useful algebraic structure:

embed("king") - embed("man") + embed("woman") ≈ embed("queen")

The famous word analogy works because gender and royalty are separable directions in embedding space.

Clusters in the space

Words cluster by type without supervision:

Cluster Examples
Countries France, Germany, Japan
Animals cat, dog, lion
Actions run, jump, walk
Emotions happy, sad, angry

The model never saw these categories labeled. They self-organized from co-occurrence patterns in training text.

Token embeddings vs. positional embeddings

The embedding table maps token ID → vector. But the model also needs to know where in the sequence a token appears.

A separate positional embedding encodes sequence position (0, 1, 2, …) as another vector. The two are added:

input_vector[i] = token_embedding[id_i] + position_embedding[i]

The transformer then operates on these combined vectors. Without positional embeddings, the model would treat "dog bites man" and "man bites dog" identically.

What changes during training

The embedding table starts randomly initialized. Every gradient update adjusts every entry. By the end of pre-training, the table has learned a compressed world model — word sense, syntactic role, domain, register, and more — all encoded in a fixed number of floats per token.

Retrieval-augmented generation (RAG)

Embeddings extend beyond individual tokens. Whole sentences or documents can be embedded by aggregating token vectors. These document embeddings power semantic search: given a query, find the stored documents whose embeddings are nearest.

That nearest-neighbor lookup is the core of RAG. The retrieved chunks are added to the prompt before the model answers, giving it access to information not in its weights.