Queueing Theory Basics
Why running hot makes everything worse.
Queueing Theory Basics
If you've ever wondered why "we're only at 90% utilisation, why are things so slow," queueing theory has the answer. It's not 90% slow — it's 9× slower than 80%, and 19× slower than 50%.
The math is unforgiving. Wait time grows hyperbolically as utilisation approaches 100%. Run hot at your peril.
Analogy
Coffee shop, 3 baristas, each pouring a drink in 3 minutes. So the shop can serve at most one customer per minute. If customers arrive at one per minute, you're at 100% utilisation — and the queue grows without bound. Get there at the wrong moment and you wait forever.
If they arrive at one every 2 minutes, you're at 50% utilisation. The queue empties between arrivals. Wait time is roughly one service time.
If they arrive at one every 1.25 minutes, you're at 80%. Wait time is now 4× the service time. Same baristas, same machines — just less slack.
Now imagine you're the manager looking at "average wait." It went from 3 minutes to 12 minutes. The variance is huge. Some customers waited 30+ minutes. They left.
Queues are everywhere
Load balancer wait queues. Database connection pools. OS run queues. Vercel function concurrency. Every place where work arrives faster than a single worker can handle is a queue, and every queue has the same fundamental dynamics.
When you see "wait time spikes when load is high," the load itself isn't the problem. The utilisation passing 80% is the problem.
Little's law in depth
L = λ × W
L is the average concurrency (items in the system). λ is the arrival rate. W is the mean time-in-system (wait + service).
This is exact for any stable queueing system in steady state. It doesn't care about the arrival distribution, the service distribution, the queue discipline. Universal.
Two practical uses:
- Forward: I know λ and W; how many concurrent slots do I need?
L = λW. - Backward: I see L=20 in flight and λ=100/sec arrival; what's the average response time?
W = L/λ = 0.2 sec.
If you only remember one equation from queueing theory, remember this one.
M/M/1 closed form
For the simplest case — Poisson arrivals (memoryless interarrival), exponential service (memoryless service times), single server — wait time has a clean closed form:
Wq = ρ / (μ × (1 − ρ))
Where ρ = λ/μ (utilisation) and μ is service rate.
The (1 − ρ) in the denominator is what makes wait time go hyperbolic as ρ → 1. The closer to fully-loaded, the smaller the denominator, the larger the wait.
The utilisation hockey stick
Concrete numbers for M/M/1:
| ρ (utilisation) | Wait time multiplier |
|---|---|
| 0.5 | 1× service time |
| 0.7 | 2.3× |
| 0.8 | 4× |
| 0.9 | 9× |
| 0.95 | 19× |
| 0.99 | 99× |
Plot it and you get the famous hockey stick. Below 80% the curve is gentle. Above 80% it explodes.
Why 80% is the heuristic
The 80% number isn't from a textbook. It's where the wait curve transitions from "linear in load" to "hyperbolic in load." Below 80%, doubling the load roughly doubles the wait. Above 80%, doubling the load makes the wait 5× to 10× worse.
Industries that take queueing seriously (call centres, hospitals, datacentres) target 70-80% utilisation on critical paths. Everything else gets compromised — it's that important.
Multiple servers (M/M/c)
With c servers and the same total μ, wait time goes down — diminishing returns above 3-4 servers per queue. The Erlang-C formula gives the exact result; for practical work, just remember:
- 2 servers ≈ 2× better than 1 at the same per-server utilisation.
- 4 servers ≈ 1.5× better than 2.
- 8 servers ≈ 1.2× better than 4.
The diminishing returns mean splitting one big queue into many small queues (sharded) often makes things worse.
Implications for capacity planning
Tie this back to lesson 05: when you provision capacity, you're really provisioning utilisation. Adding 50% headroom on a critical path means targeting 50-67% utilisation. Adding 100% headroom means targeting 33-50%. The numbers feel wasteful — half the machine sitting idle? — until you've watched a service melt down at 92% utilisation and remembered the hockey stick.
In the playground
Run a Coffee Shop staffs N baristas at a chosen price ($2-$8 affects arrival rate). A 4-hour shift simulates customers arriving Poisson, getting served, sometimes walking out. Win condition: average wait < 5 min AND profit > 0.
Tools in the wild
5 tools- clik6free tier
JS-scripted load tester from Grafana Labs — run constant-arrival-rate scenarios for queueing experiments.
- cliVegetafree tier
Constant-rate HTTP attacker; perfect for measuring p99 vs utilization curves.
- cliwrk2free tier
Gil Tene's load tester that fixes the coordinated-omission problem in latency reporting.
- libraryLocustfree tier
Python-scripted, distributed load generator with a real-time web UI.
- cliApache JMeterfree tier
Long-standing GUI/CLI load tool — ubiquitous in enterprise performance teams.