Scaling
Read replicas, sharding, pools, and caches — when one server is not enough.
Scaling
A single database server has limits: CPU, RAM, disk I/O, and network bandwidth. Real applications eventually hit at least one. Scaling is the set of techniques you reach for when they do — in roughly the order you should try them.
Analogy
A single database is like one post office branch in a growing town. At first the one clerk handles everything; eventually the line is out the door. Read replicas are satellite branches that copy the master records overnight — most people just want to check whether a parcel has arrived (a read), and those branches handle it fine. A connection pool is the row of numbered PO boxes along the wall: a fixed set of slots that customers rent and reuse instead of forming a new queue at the counter every time. A cache is the "Will Call" shelf behind the desk where the same five frequently-requested items sit pre-pulled. Sharding is giving up on one branch entirely — splitting the town into zip codes, each with its own post office handling its own mail.
The latency numbers you should know
| Operation | Approximate latency |
|---|---|
| L1 cache read | 0.5 ns |
| L2 cache read | 4 ns |
| RAM read | 100 ns |
| SSD random read | 100 µs |
| Network round trip (same datacenter) | 500 µs |
| Database query (simple, indexed) | 1–5 ms |
| Network round trip (across region) | 10–100 ms |
These numbers explain why caching works: a cache hit is memory (100 ns) instead of database (1–5 ms) — three to four orders of magnitude faster.
Read replicas
Most applications read far more than they write — 95% reads is common. A read replica is a copy of the primary database that receives all writes via replication and serves reads directly. The primary handles writes and replicates to replicas.
Benefits: read throughput scales horizontally, the primary is less loaded. Limits: replicas may lag the primary by milliseconds. Reads from replicas can be stale. Do not route balance checks or payment confirmation reads to a replica.
Sharding (horizontal partitioning)
Sharding splits a large table across multiple database nodes. Each node holds a shard — a subset of rows.
Range sharding
Rows are split by a key range: user IDs 1–1M go to shard 1, 1M–2M to shard 2. Simple to reason about. Hot spots form if one range gets disproportionate traffic.
Hash sharding
shard = hash(user_id) % num_shards. Distributes load evenly. Range queries are inefficient — you must query all shards and merge.
Consistent hashing
A ring-based mapping of keys to nodes. Adding or removing a node only remaps a fraction of keys (1/n of the keyspace), not all of them. Essential for distributed caches (Redis Cluster, Cassandra).
Directory sharding
A lookup table maps each key to the shard that holds it. Maximum flexibility, but the directory itself becomes a bottleneck and a single point of failure.
Connection pools
Every open database connection consumes memory and file descriptors — on PostgreSQL, roughly 5–10 MB per connection. An application that opens a new connection per HTTP request can exhaust the database at a few hundred concurrent users.
A connection pool maintains a fixed number of long-lived connections and loans them to application requests. Common choices:
| Tool | Language/Context |
|---|---|
| PgBouncer | PostgreSQL sidecar proxy |
| Pgpool-II | Full proxy with load balancing |
| Prisma's pool | Built into the Prisma ORM |
| HikariCP | JVM |
asyncpg built-in pool |
Python async |
Pool modes: session (one connection per client for the session), transaction (connection released after each commit), statement (released after each statement — incompatible with multi-statement transactions).
Caching layers
A cache stores query results in memory, serving repeated reads without touching the database.
Common patterns:
- Cache-aside: application checks cache first, falls back to database, writes result to cache.
- Write-through: writes go to cache and database simultaneously.
- Read-through: cache handles the fallback transparently.
TTL (time-to-live) controls freshness. Short TTL = fresher data, more cache misses. Long TTL = stale data risk.
Redis and Memcached are the dominant choices. Redis supports richer data structures (sorted sets, pub/sub) and persistence; Memcached is simpler and faster for pure key-value.
When to reach for what
| Problem | First tool to try |
|---|---|
| Too many reads | Read replica |
| Rows too numerous for one node | Sharding |
| Too many connections | Connection pool |
| Same data queried repeatedly | Cache |
| Write throughput limited | Sharding or write batching |
Scale vertically first (bigger machine). Horizontal scaling adds operational complexity. Each step up is a trade-off: you gain throughput and lose simplicity.
The playground
Design a scaling plan for a seeded read-heavy application. Enable or disable components (replica, pool, cache, shard) and see how the architecture diagram morphs and how the validator scores your choices for that workload profile.