Capacity Planning
Knowing what breaks first, before it does.
Capacity Planning
A system that runs out of capacity does not gracefully slow down. It falls off a cliff. Capacity planning is the discipline of knowing where the cliff is, when you'll arrive at it, and what your options are when you do.
The work is not "buy more servers." The work is knowing what breaks first under projected load, and choosing between four options well before the wall: scale up, rate-limit, redesign, or accept the SLO breach.
Analogy
Picture a coffee shop with three baristas, two espresso machines, and one credit-card reader. On a quiet morning the limits don't matter — the baristas are idle, the machines are warm, the reader sits there. As the morning rush builds, you discover which limit hits first. Maybe the credit-card reader is slow and a queue forms while customers pay. The baristas could pour faster but they're waiting. Adding a fourth barista doesn't help. Adding a second reader does.
Capacity planning is the same. Every system has multiple capacity dimensions, and only one is the binding constraint at any given load. Adding capacity to a non-binding dimension changes nothing.
Little's law
The single most useful equation in capacity planning is Little's law:
L = λ × W
Where L is the average number of items in the system, λ is the arrival rate (items per second), and W is the average time each item spends in the system. It is exact for any stable queueing system in steady state, regardless of arrival distribution or service distribution.
Worked example: a service handles 100 requests per second with an average response time of 200 milliseconds. Then on average there are 100 × 0.2 = 20 requests in flight at any moment. The function-concurrency budget must be at least 20, with safety margin.
This sets a floor. The next question is: what's the ceiling?
Three capacity dimensions
Every service has at least three independent capacity ceilings that scale differently:
Compute — CPU, RAM, function invocations, container slots. Bursty (cold starts cost time). Often the easiest to scale by adding instances, but hardest to scale instantly.
Connections — database connection pool, in-flight HTTP requests, websocket count. Hard cap. Adding instances doesn't help if every instance pulls from the same pool.
Storage and IOPS — KV ops/sec, database writes/sec, network bandwidth. Often vendor-tier-bounded — you don't add capacity by deploying more code, you upgrade a tier (and write a check).
The binding constraint at low load is usually compute. As load grows, connections often become the bottleneck. At very high load, IOPS or bandwidth wins.
Headroom
Headroom is the gap between current load and capacity-at-which-bad-things-happen. Industry default for critical paths is around 50% headroom — meaning if your peak observed load is X, you provision for 2X.
Why so much? Two reasons.
The first is variance. Peak observed load is not the same as peak possible load. A 7-day measurement window misses the once-a-month event.
The second is queueing. Utilisation above 80% causes wait time to grow hyperbolically. We cover the math in lesson 08, but the upshot is that running a service at 95% utilisation makes every request slow, even though no individual capacity ceiling has been hit.
Growth projection
Three growth shapes drive different planning postures:
Linear — steady acquisition. Plan capacity for current load × growth rate × planning horizon. Easy.
Exponential — viral or compounding. Plan for the next doubling, not the current rate. Hard, because you can't out-provision exponential growth indefinitely. At some point you must redesign.
Step function — a marketing campaign, a product launch, a viral mention. The hardest to plan for because the step is short-lived, and over-provisioning permanently is wasteful. This is the case where rate-limiting often beats scaling.
The capacity decision tree
When projection shows tight capacity, four options:
- Scale up — add instances or upgrade machine size. Quickest. Costs money and you have to undo it after the spike if you want to save money.
- Rate-limit — accept some load, reject the rest with a clear error. Cheaper than scaling, costs you customers (or just slows them down) during the limit window. Right answer for short, predictable spikes.
- Redesign — queueing layer, async processing, sharding, caching. Highest cost, longest to build, only option that makes the next 10× sustainable.
- Accept the SLO breach — burn error budget. Right answer when headroom is already sufficient and the projected overrun is small (you have budget; spend it on the spike).
The wrong answer is panicking and choosing without doing the math.
In the playground
The Capacity Lab playground asks you to provision compute, DB connections, and KV ops before you know the traffic shape. Then a scenario plays out — Tuesday-normal, marketing spike, or Black Friday — and you watch your service handle (or drop) the load. Win by surviving all three under budget.
Tools in the wild
6 tools- serviceKubecostfree tier
Cost + capacity allocation per Kubernetes namespace, deployment, and pod.
- libraryKarpenterfree tier
Fast node autoscaler for EKS — picks optimal instance types by pending pod requirements.
- libraryKEDAfree tier
Event-driven autoscaler for Kubernetes — scale on queue depth, Kafka lag, cron, etc.
- libraryGoldilocksfree tier
Fairwinds tool that recommends right-sized CPU/memory requests from VPA history.
- serviceAWS Compute Optimizerfree tier
Analyzes EC2/RDS/Lambda usage and recommends instance-size + family changes.
- cliVegetafree tier
HTTP load tester — drives constant request rate to find the knee in the latency curve.