system-design · level 3

Sharding & Partitioning

Hash, range, geo — and the keys that decide where every row lives.

150 XP

Sharding & Partitioning

Once a single database can't hold the dataset (storage), can't keep up with the writes (throughput), or can't return queries fast enough (latency), you split it. Sharding is splitting one logical table across many physical machines. Every row lives on exactly one shard. The router decides which.

The decision is simple to state and brutal to get wrong. Pick the right strategy + the right shard key and the system scales horizontally for the next decade. Pick wrong and you'll spend three engineer-years undoing it.

The four strategies

Hash sharding

Hash the shard key into a number, take that number % N, and route to that shard.

shard = fnv1a(user_id) % 4

Strengths:

  • Even distribution, basically by definition. A good hash function spreads keys uniformly.
  • Even load as long as the workload doesn't have hot keys (one celebrity user with a million rows).
  • Point lookups are O(1) — single shard hit.

Weaknesses:

  • Range queries are scatter-gather. WHERE created_at BETWEEN ... needs to hit every shard, merge, sort. Cross-shard queries are 10–100× slower.
  • Resharding is catastrophic. Going from 4 → 5 shards re-hashes ~80% of every row. Hours of downtime or weeks of dual-writing.

When to use it: point-lookup-heavy workloads where rows are looked up by their primary key. Users-by-id, sessions-by-token, items-by-sku. The default if you don't have a strong reason to pick something else.

Range sharding

Each shard owns a contiguous range of the key space.

S0: id 0..1M       S1: id 1M..2M       S2: id 2M..3M       S3: id 3M..∞

Strengths:

  • Range scans are efficientBETWEEN queries hit one or two shards.
  • Hot/cold tiering is natural — old ranges go on cheap storage, new ranges on fast NVMe.
  • Easy to add a new shard — split the most-loaded range or pre-create future ranges.

Weaknesses:

  • Hot shards by default. If your shard key is created_at, every new write lands on the most-recent shard. The other shards are idle.
  • Skew amplifies — celebrities, viral content, top customers all funnel into whichever shard owns their slice.

When to use it: time-series, append-mostly logs, anything you query "the last N days of." Pair with sub-sharding (range-then-hash) when one tier gets too hot.

Geographic sharding

Route by user geography. EU users → EU cluster. US users → US cluster. APAC → APAC.

geo = lookup(user.country)
shard = geo + ":" + (user_id % SHARDS_PER_REGION)

Strengths:

  • Latency — users hit a regional cluster, not a transatlantic one.
  • Compliance — GDPR, data residency requirements often force this.
  • Blast radius — a US-region outage doesn't take down EU users.

Weaknesses:

  • Cross-region queries are expensive. A multi-region report joins data from 4 regions across 4 continents.
  • Migration is hard — when a user moves countries, their data has to move with them.
  • You're now running N times the infrastructure.

When to use it: latency-critical user-facing apps, legal data-residency requirements, B2C where ~95% of queries stay regional.

Consistent hashing

Hash the shard key onto a ring (often 0 .. 2^32 - 1). Hash each shard onto the same ring (with virtual nodes — usually 100–500 per shard for even distribution). For a key, walk clockwise to the next shard.

ring positions:
  S0:  450  890  1240  ...    (virtual nodes)
  S1:  120  670  1500  ...
  S2:  330  780  1100  ...
  S3:  210  920  1380  ...

key hash = 700  →  next clockwise virtual node = S0@890  →  shard S0

Strengths:

  • Adding a shard moves only ~1/N of keys. Going from 4 → 5 shards re-routes ~20% of data, not 80%.
  • Removing a shard is gentle — its keys redistribute across the surviving shards, ~1/(N-1) each.
  • Resharding is incremental — you can add a shard without taking the system down.

Weaknesses:

  • More complex routing. Every client needs the ring topology.
  • Range queries are still scatter-gather — you've solved elasticity, not query patterns.
  • Without virtual nodes, distribution is uneven — you absolutely need vnodes (200+) to smooth it out.

When to use it: any time you expect to grow the shard count over time. Caches (memcached, Cassandra, DynamoDB partitions all use it), CDN edge selection, distributed key-value stores.

Picking the shard key

This is the single most important decision once you commit to sharding. The shard key is the input to the routing function — the column whose value decides which shard the row lives on.

A good shard key has three properties:

Property Why it matters
High cardinality Millions of distinct values. country is bad (200 values). user_id is good.
Even distribution No celebrity row that takes 90% of traffic. If you have one, sub-shard inside it.
Query alignment Most queries should know the shard key. If 90% of queries are WHERE user_id = ?, shard on user_id.

Common mistakes:

  • Sharding on created_at. All new writes hit one shard. Hot shard, idle fleet.
  • Sharding on a low-cardinality enum. tenant_id with 12 tenants → at most 12 shards, regardless of fleet size.
  • Sharding on the wrong column for the query. If shard key is user_id but 80% of queries are WHERE order_id = ?, every query becomes a scatter-gather.

Hot shards & celebrities

Even with a good shard key, you'll have outliers. The Justin Bieber problem: one celebrity has 100M followers, their writes are 100× a normal user. Whichever shard owns Justin gets melted.

Mitigations:

  • Sub-sharding (split-and-spread). Detect the hot row, split its data across multiple shards keyed by (user_id, sub_partition). Costs you a fanout on reads but distributes the writes.
  • Replication for reads. If the hot row is read-heavy more than write-heavy, replicate it to read replicas across shards.
  • Caching. Hot reads should rarely hit the DB at all — Redis in front, with very short TTLs.
  • Write coalescing. Batch writes to hot rows in the application before persisting (counter increments are the canonical case).

Resharding

The day comes when N shards is no longer enough. The pain is proportional to how much you planned for it on day one.

Naive userid % N: ~80% of rows move when you go from 4 → 5 shards. Plan for hours of downtime or a complex dual-write migration.

Consistent hashing: ~20% of rows move when you go from 4 → 5. Migration can be done online without dual-writes.

Range sharding: Pre-create future ranges. When S3 gets full, split it into S3a (old range) and S3b (new range). No rewrite of existing data.

Vitess-style topology: Logical keyspaces decoupled from physical shards. You add a physical shard, the keyspace router redistributes incrementally in the background.

The hard-earned rule: resharding is a constant tax at scale. Either pay it incrementally with consistent-hashing/Vitess from day one, or pay it in giant traumatic migrations every few years.

Cross-shard joins

If your query needs to join two tables that are sharded on different keys, you have three options:

  1. Co-locate. Shard both tables on the same key (e.g. orders + users both sharded on user_id). The join stays single-shard.
  2. Denormalise. Pre-join at write time. Store user.name on every order row. Saves a join, costs storage + invalidation pain.
  3. Scatter-gather + merge. The router asks every shard, collects, joins in memory. 10–100× slower; OK for batch jobs, not for hot paths.

The common architectural choice: co-locate the hot-path joins, scatter-gather for the analytics-y stuff (or pump it to a separate analytical store entirely).

Common tools

  • Vitess — MySQL sharding control plane. Powers Slack, GitHub, YouTube. Topology, routing, online resharding.
  • Citus — Postgres extension. Hash + range partitioning, distributed transactions, columnar tables.
  • MongoDB — Built-in sharded cluster. Hashed shards, ranged shards, zone sharding for geo.
  • Cassandra / ScyllaDB — Consistent hashing under the hood. Tunable consistency, multi-region built in.
  • DynamoDB — Managed AWS, partition-key + sort-key. Hides the consistent hashing entirely.
  • PlanetScale — Managed Vitess. Schema branching, zero-downtime migrations.

Decision tree

A reasonable progression:

  1. Don't shard. A single Postgres can take you very, very far. Do you actually need it?
  2. Vertical scale first. A bigger box buys you 5–10×.
  3. Read replicas next. If reads are the bottleneck, fan out reads.
  4. Sharding last — and when you do it, use consistent hashing or a Vitess-style abstraction so future resharding is incremental.

Almost no team that thinks they "need to shard" actually does, until quite late. But the ones that genuinely do need it — and pick the wrong strategy — pay for it forever.

Tools in the wild

4 tools
  • Vitessfree tier

    MySQL sharding control plane. Powers YouTube, Slack, GitHub. Topology-aware routing.

    service
  • Citusfree tier

    Postgres extension that turns it into a sharded distributed DB. Range + hash partitioning.

    service
  • Managed Vitess. Sharding, branching, schema diffs without downtime.

    service
  • Built-in sharded cluster: shard keys, balancer, hashed/ranged shards, zone sharding for geo.

    service