databases · level 9

Vector Databases

Embeddings, HNSW, and when to add a vector store vs cram into Postgres.

200 XP

Vector Databases

A vector database is "how do I find the records most semantically similar to this input" at scale. Embeddings + approximate nearest-neighbour indexes are the technical answer; the ecosystem question is "do I really need a separate database for this?" — and most teams' answer should be no.

Analogy

Imagine you have a library where books aren't organised by Dewey Decimal but by their actual ideas. To shelve a book you read it, distil its themes into a 1500-dimensional fingerprint, and place it near books with matching fingerprints. To find books "about hopeful sci-fi", you compute the fingerprint of that phrase and look at the nearest neighbours on the shelf. The fingerprint is the embedding. The shelving algorithm is the ANN index. The library is the vector database. You don't strictly need a special library — you can keep the books in your existing one and just add the fingerprint stamp — but at some scale, a dedicated library starts to make sense.

Embeddings — what they are

An embedding model takes input (text, image, audio) and outputs a fixed-length vector of floats. Similar inputs produce vectors close to each other.

Common dimensions:

OpenAI text-embedding-3-small: 1536 (or 256/512/1024 with truncation).
OpenAI text-embedding-3-large: 3072.
Cohere v3: 1024.
BGE / e5 (open-source): 384–1024.
CLIP for images: 512–768.

The actual numbers in the vector are meaningless to you; they're learned features. What matters is distance — vectors close together represent semantically similar inputs.

Distance metrics

Three common ways to measure "how close":

Metric	Formula	Best for
Cosine	dot(a, b) / (‖a‖ × ‖b‖)	Direction-only similarity (normalised text embeddings — the default)
Euclidean (L2)	√Σ(a_i − b_i)²	When magnitude matters (some image features)
Dot product	Σ a_i × b_i	When vectors are pre-normalised — same as cosine but cheaper

For text embeddings, cosine is the default. Most embedding models output vectors that are already (approximately) unit-normalised, so dot product gives the same ordering as cosine.

ANN — approximate nearest neighbour

Exhaustive search is O(n × d) — for every query, compute distance to every stored vector. Fine at thousands; deadly at millions. The trick: an ANN index that gives you ~95–99% of the true top-k for ~1000× the speedup.

The dominant algorithm: HNSW (Hierarchical Navigable Small World) graphs.

The intuition:

Build a multi-layer graph. Top layer = sparse set of "highway" nodes; bottom layer = every vector.
To search: start at the top layer, greedily walk toward the query vector, drop down a layer when no closer neighbour, repeat.
The "small world" property (random long edges) ensures fast convergence.

HNSW gives sub-millisecond queries on multi-million-vector indexes with 95–99% recall. It's the de-facto standard — pgvector, Pinecone, Qdrant, Weaviate, Milvus, FAISS all default to HNSW.

Older alternatives:

IVF (inverted file): cluster the vectors into k centroids; search the nearest m centroids' lists. Lower memory, lower recall.
PQ (product quantization): compress vectors via per-subvector codebooks. Trades recall for memory.
IVF+PQ: combine both. The classic FAISS combo for billion-scale.
DiskANN: SSD-resident HNSW for billion-scale.

For most teams, HNSW is the right default until you have a reason to look elsewhere.

Recall

ANN trades exactness for speed. The metric:

Recall@k = |actual top-k ∩ ANN top-k| / k

If you ask for the top 10 and ANN returns 9 of the true top 10, recall@10 = 90%. Production HNSW configurations target 95–99%.

The knobs (in pgvector's HNSW):

m — graph fanout (default 16). Higher = better recall, more memory.
ef_construction — index build effort (default 64). Higher = slower build, better recall.
ef_search (hnsw.ef_search) — query effort. Higher = slower query, better recall.

Tune ef_search first; build-time params are sticky.

pgvector — the right starting point for most teams

pgvector is a Postgres extension. You add a vector(N) column type, build an HNSW index, and query with operators:

CREATE EXTENSION vector;

CREATE TABLE docs (
  id        SERIAL PRIMARY KEY,
  body      text,
  embedding vector(1536)
);

CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);

-- top-5 most similar
SELECT id, body, 1 - (embedding <=> $1) AS similarity
FROM docs
ORDER BY embedding <=> $1
LIMIT 5;

Operators:

<=> — cosine distance.
<-> — L2 distance.
<#> — negative inner product (sortable).

Why this is usually the right answer:

Joins for free. Combine vector search with WHERE user_id = ?, JOIN articles, etc.
One database, one transaction model. Embeddings update atomically with the rest of your data.
No new operational story. Backups, replication, monitoring — same as your existing Postgres.
First-class on managed Postgres. AWS RDS / Supabase / Neon / GCP Cloud SQL all support it.

When to pick a dedicated vector DB

Once you genuinely outgrow pgvector. Signals:

Hundreds of millions of vectors and you need sub-10ms p99.
Sharding and rebalancing of the vector index becomes a pain in Postgres.
Multi-tenant filtering at scale (different tenants seeing different subsets) is slow because pgvector's filter happens after the ANN search.
GPU-accelerated search is needed (Pinecone has, Postgres doesn't natively).
You want separate scaling for vector workloads vs OLTP.

The contenders:

Pinecone — managed, sharded, popular for production RAG.
Qdrant — Rust, open-source, great filtering.
Weaviate — open-source, multi-tenant, GraphQL.
Milvus — CNCF-graduated; designed for billion-scale.
Vespa, Vald — niche but production.

The ops cost: another database to operate, monitor, back up, secure, and pay for.

Common pitfalls

Embedding dimension mismatch. A vector(1536) column can't store a 3072-dim vector. Pick your embedding model first and lock it in; migrating dimensions later is an expand-contract migration.

Forgetting to normalise. Cosine distance assumes unit-length vectors for some libraries; dot product assumes it everywhere. Most modern embedding models return normalised vectors, but check.

Filter-then-search vs search-then-filter. Filtering after ANN can return fewer than k results because the ANN top-k might not all match the filter. Pre-filtering (push filter into the index) requires a vector DB that supports it natively (Qdrant, Weaviate). pgvector does post-filtering; oversample with LIMIT k * 5 and filter in app.

Storing vectors in JSON. Always use the native vector column type. JSON-encoded floats blow up memory, can't use indexes, and are 5–10× slower.

Stale embeddings. When you update the source text, you must re-embed and re-store. Build a "needs re-embedding" flag and a background job.

Mixing embedding models. Comparing a query embedding from model A against stored vectors from model B is meaningless — they live in different spaces.

A reasonable architecture

For most RAG / semantic-search use cases:

Embed at write time in a background job. Don't block the user-facing write path.
Store in pgvector alongside the source text and metadata.
At query time: embed the query (cached if repeated), HNSW search top-k, optionally re-rank with a cross-encoder for the top 50.
Cache the top results aggressively — the same query embeds to the same vector, hits the same top-k.
Monitor recall + latency. When either degrades past the SLO, investigate index params before swapping databases.

That stack handles tens of millions of vectors with sub-100ms p99 on a modest Postgres instance. Most production AI features don't need more.

When the answer is "no vector DB at all"

For small datasets (<10k vectors), even an in-memory exhaustive scan with NumPy is fine:

import numpy as np
def topk(query: np.ndarray, vectors: np.ndarray, k=5):
    sims = vectors @ query   # assumes both unit-normalised
    return np.argsort(-sims)[:k]

For ~100k vectors, FAISS in-process gives sub-ms query times with no database at all. Add a database only when persistence + concurrency demand it.

Tools in the wild

6 tools

pgvectorfree tier
Postgres extension; vector columns, HNSW + IVFFlat indexes; cosine / L2 / inner-product.
library
Pinecone
Managed vector DB; sharded HNSW; metadata filtering at scale.
service
Qdrantfree tier
Open-source Rust vector DB; can self-host or use managed.
library
Weaviatefree tier
Open-source vector DB with native multi-tenancy + GraphQL API.
library
Milvusfree tier
CNCF-graduated vector DB; supports very large-scale workloads.
library
OpenAI embeddings
text-embedding-3-small / -large; 1536 / 3072 dimensions.
service