ai · level 3

Vectors and Similarity

Direction encodes relationship. Cosine measures it.

150 XP

Vectors and Similarity

Embeddings are vectors. Comparing two embeddings means computing how similar their directions are. The standard measure is cosine similarity.

Analogy

Imagine two hikers standing back-to-back on a hilltop, each pointing at a distant peak. What matters is whether their arms point the same way — not how long their arms are. One hiker might be a kid, one might be a giant, but if both are aimed at the same peak, they "agree". Spin one of them ninety degrees and the agreement collapses to nothing; turn them fully opposite and they disagree as strongly as possible.

Cosine similarity

Cosine similarity ignores vector magnitude and measures only angle. Two vectors pointing in the same direction score 1.0. Perpendicular vectors score 0. Opposite vectors score -1.0.

cosine(A, B) = (A · B) / (|A| × |B|)

For normalized vectors (length = 1), the dot product alone gives the cosine similarity. Most embedding libraries normalize embeddings before storage to make lookup fast.

Score Meaning
0.95 – 1.0 Near-identical meaning
0.80 – 0.95 Strongly related
0.60 – 0.80 Related, not synonymous
< 0.60 Weak or no relation

Euclidean distance vs. cosine

Euclidean distance is sensitive to magnitude. A short vector for "cat" and a long vector for "feline" might show large Euclidean distance even if they point in the same direction. Cosine similarity avoids this — it normalizes out the magnitude. Use cosine when you care about semantic direction, Euclidean when magnitude carries information.

Vector operations on meaning

The vector space is linear. You can do arithmetic on meaning:

v("Paris") - v("France") + v("Germany") ≈ v("Berlin")

This works because the directions encoding "capital city of" and "country" are roughly consistent across the learned space.

Nearest-neighbor search

Given a query vector, find the k stored vectors with the highest cosine similarity. Naively this is O(n × d) — every vector, every dimension. At scale (millions of documents, thousands of dimensions), approximate nearest-neighbor (ANN) algorithms are used:

  • HNSW (Hierarchical Navigable Small World) — graph-based, very fast, slight recall trade-off
  • IVF (Inverted File Index) — clusters vectors, searches only relevant clusters
  • PQ (Product Quantization) — compresses vectors, trades precision for memory

Vector databases (Pinecone, Weaviate, pgvector) implement these algorithms. The query flow is: embed the query, search the index, retrieve the top-k chunk IDs.

Dimensions and meaning

High-dimensional spaces have counter-intuitive geometry. Almost all pairs of random vectors are nearly orthogonal. The model exploits this: with 12,288 dimensions, there is room to encode thousands of independent semantic properties without interference.

A toy example with 3 dimensions:

dimension 0: royalty
dimension 1: gender (positive=female)
dimension 2: animacy

king  ≈ [0.9,  -0.8,  0.7]
queen ≈ [0.9,   0.8,  0.7]
dog   ≈ [0.0,   0.0,  0.8]
table ≈ [0.0,   0.0, -0.9]

Real embeddings don't have clean human-interpretable dimensions, but the same logic holds at scale.

From vectors to retrieval

The pattern for retrieval-augmented generation:

  1. Embed all documents at index time. Store (vector, chunk_id) pairs.
  2. At query time, embed the user's question.
  3. Run ANN search. Return top-k chunk IDs.
  4. Fetch the chunks, prepend them to the prompt.
  5. The model answers with access to the retrieved evidence.

The quality of retrieval depends entirely on embedding quality. A model that embeds "lawsuit" and "litigation" close together will retrieve relevant legal documents for either query term.