sre · level 8

Queueing Theory Basics

Why running hot makes everything worse.

200 XP

Queueing Theory Basics

If you've ever wondered why "we're only at 90% utilisation, why are things so slow," queueing theory has the answer. It's not 90% slow — it's 9× slower than 80%, and 19× slower than 50%.

The math is unforgiving. Wait time grows hyperbolically as utilisation approaches 100%. Run hot at your peril.

Analogy

Coffee shop, 3 baristas, each pouring a drink in 3 minutes. So the shop can serve at most one customer per minute. If customers arrive at one per minute, you're at 100% utilisation — and the queue grows without bound. Get there at the wrong moment and you wait forever.

If they arrive at one every 2 minutes, you're at 50% utilisation. The queue empties between arrivals. Wait time is roughly one service time.

If they arrive at one every 1.25 minutes, you're at 80%. Wait time is now 4× the service time. Same baristas, same machines — just less slack.

Now imagine you're the manager looking at "average wait." It went from 3 minutes to 12 minutes. The variance is huge. Some customers waited 30+ minutes. They left.

Queues are everywhere

Load balancer wait queues. Database connection pools. OS run queues. Vercel function concurrency. Every place where work arrives faster than a single worker can handle is a queue, and every queue has the same fundamental dynamics.

When you see "wait time spikes when load is high," the load itself isn't the problem. The utilisation passing 80% is the problem.

Little's law in depth

L = λ × W

L is the average concurrency (items in the system). λ is the arrival rate. W is the mean time-in-system (wait + service).

This is exact for any stable queueing system in steady state. It doesn't care about the arrival distribution, the service distribution, the queue discipline. Universal.

Two practical uses:

  • Forward: I know λ and W; how many concurrent slots do I need? L = λW.
  • Backward: I see L=20 in flight and λ=100/sec arrival; what's the average response time? W = L/λ = 0.2 sec.

If you only remember one equation from queueing theory, remember this one.

M/M/1 closed form

For the simplest case — Poisson arrivals (memoryless interarrival), exponential service (memoryless service times), single server — wait time has a clean closed form:

Wq = ρ / (μ × (1 − ρ))

Where ρ = λ/μ (utilisation) and μ is service rate.

The (1 − ρ) in the denominator is what makes wait time go hyperbolic as ρ → 1. The closer to fully-loaded, the smaller the denominator, the larger the wait.

The utilisation hockey stick

Concrete numbers for M/M/1:

ρ (utilisation) Wait time multiplier
0.5 1× service time
0.7 2.3×
0.8
0.9
0.95 19×
0.99 99×

Plot it and you get the famous hockey stick. Below 80% the curve is gentle. Above 80% it explodes.

Why 80% is the heuristic

The 80% number isn't from a textbook. It's where the wait curve transitions from "linear in load" to "hyperbolic in load." Below 80%, doubling the load roughly doubles the wait. Above 80%, doubling the load makes the wait 5× to 10× worse.

Industries that take queueing seriously (call centres, hospitals, datacentres) target 70-80% utilisation on critical paths. Everything else gets compromised — it's that important.

Multiple servers (M/M/c)

With c servers and the same total μ, wait time goes down — diminishing returns above 3-4 servers per queue. The Erlang-C formula gives the exact result; for practical work, just remember:

  • 2 servers ≈ 2× better than 1 at the same per-server utilisation.
  • 4 servers ≈ 1.5× better than 2.
  • 8 servers ≈ 1.2× better than 4.

The diminishing returns mean splitting one big queue into many small queues (sharded) often makes things worse.

Implications for capacity planning

Tie this back to lesson 05: when you provision capacity, you're really provisioning utilisation. Adding 50% headroom on a critical path means targeting 50-67% utilisation. Adding 100% headroom means targeting 33-50%. The numbers feel wasteful — half the machine sitting idle? — until you've watched a service melt down at 92% utilisation and remembered the hockey stick.

In the playground

Run a Coffee Shop staffs N baristas at a chosen price ($2-$8 affects arrival rate). A 4-hour shift simulates customers arriving Poisson, getting served, sometimes walking out. Win condition: average wait < 5 min AND profit > 0.

Tools in the wild

5 tools
  • k6free tier

    JS-scripted load tester from Grafana Labs — run constant-arrival-rate scenarios for queueing experiments.

    cli
  • Vegetafree tier

    Constant-rate HTTP attacker; perfect for measuring p99 vs utilization curves.

    cli
  • wrk2free tier

    Gil Tene's load tester that fixes the coordinated-omission problem in latency reporting.

    cli
  • Locustfree tier

    Python-scripted, distributed load generator with a real-time web UI.

    library
  • Apache JMeterfree tier

    Long-standing GUI/CLI load tool — ubiquitous in enterprise performance teams.

    cli