Load Balancing
Round-robin, least-connections, weighted, hash — and why each exists.
Load Balancing
A load balancer sits in front of a pool of backend servers and decides where each incoming request goes. Get the algorithm right and the system scales horizontally with predictable latency. Get it wrong and one server melts while the others stay idle.
Why load balancers exist
Three jobs, in roughly this order of importance:
- Distribute load so no single server becomes the bottleneck.
- Hide failures from the client — when a backend dies, route around it.
- Decouple client connections from backend connections so you can add/remove servers without anyone reconnecting.
The third one is the under-appreciated reason. The LB owns the public address. Backends can scale, restart, deploy, get replaced — the client never knows.
The four algorithms you must know
Round-robin
Each request goes to the next server in the list, cycling forever. req1 → A, req2 → B, req3 → C, req4 → A, ...
| When it works | When it doesn't |
|---|---|
| Servers are identical in capacity | Servers vary (one box has more cores) |
| Requests are roughly the same cost | Some requests take 10ms, others take 5s |
| You want predictable, debuggable distribution | Workload is bursty per-server |
It's the default for a reason — it's stupid simple and performs fine when its assumptions hold.
Least-connections
Each request goes to whichever server is currently holding the fewest in-flight requests.
This is the algorithm you want when request cost varies wildly. If /api/search takes 5s and /api/healthcheck takes 2ms, round-robin will randomly clobber whichever box happens to receive 3 search requests in a row. Least-connections naturally smooths this out: the box currently bogged down in long requests stops getting new ones until it catches up.
The variant least-outstanding-requests (used by AWS ALB) is the same idea — it counts requests that haven't returned a response yet.
Weighted round-robin (or weighted least-connections)
You assign each server a numeric weight. The algorithm picks proportionally:
upstream api {
server api-1.local weight=4; # new box, 32 cores → 4× share
server api-2.local weight=4;
server api-3.local weight=1; # old box, 8 cores → 1× share
}
Use this when your fleet is heterogeneous — old boxes mixed with new, on-demand instances mixed with spot, multiple instance sizes. Set weights proportional to capacity.
Consistent hashing (a.k.a. sticky sessions, session affinity)
You hash a key — user id, session cookie, file id — into the ring of servers. The same key always maps to the same server, even as servers join or leave the pool.
Use this when state lives on the server:
- A multi-part file upload — chunks 1–5 must hit the same box.
- An in-memory cache — same user → same box → cache hits, not misses.
- A WebSocket connection — once the connection is up, all messages on it stay there.
The clever trick (the actual "consistent" part) is that adding a 5th server only re-routes ~1/5 of the keys, not all of them. Naive hash(key) % n would shuffle everyone whenever n changes.
Health checks
Every load balancer worth using continuously probes its backends:
GET /healthz HTTP/1.1
Three failed probes → mark the backend "unhealthy" → stop sending traffic. Three successful probes after that → mark it back to healthy.
Two pitfalls:
- Make the probe deep enough to catch real failure. A
/healthzthat returns 200 even when the database is down is a lie. Better: probe a real endpoint that touches the dependencies you care about. - But not so deep that healthchecks cause load. A probe that runs a 500ms aggregation every 5 seconds across 100 LBs is a serious cost.
The sweet spot: a /healthz that pings the DB with a SELECT 1, returns 200 if it works, 503 if it doesn't. Cheap, honest, fast.
Layer 4 vs Layer 7
There are two places an LB can sit:
- Layer 4 (TCP/UDP) — the LB sees source/dest IP+port and forwards bytes. It doesn't read the HTTP request. Faster, simpler, can balance any TCP protocol.
- Layer 7 (HTTP) — the LB parses the HTTP request and can route based on path, header, cookie, method. Slower, but enables path-based routing, host-based routing, header rewrites, TLS termination.
Most modern stacks run L7 at the edge (NGINX, ALB, Envoy) for HTTP smarts and L4 internally (kube-proxy, IPVS) for raw throughput.
What can go wrong
Thundering herd on restart. When you restart 50% of a fleet at once, the surviving 50% takes 100% of the load. If they were already at 70%, they'll hit 140% and tip over. Roll deploys carefully (10% at a time, wait for steady state, then continue).
Slow start. A freshly started backend has empty caches, JIT not warm, connection pool not yet established. Sending it full traffic immediately tips it over. Most production LBs have a "slow start" feature — ramp traffic from 1% → 100% over the first 30 seconds.
Sticky-session traps. If you bind a user to a server and that server dies, the user has a bad time — their session vanishes. Combine sticky sessions with session storage in Redis so any server can pick up the work.
Picking an algorithm
A reasonable default progression:
- Start with round-robin + identical instance type. Fine for most services.
- Move to least-connections when you see uneven load (
p99on box A is 3× box B). - Move to weighted when the fleet stops being uniform.
- Add consistent hashing only when you need server affinity — almost always for state.
Avoid premature optimisation. The number of services that "need" anything fancier than round-robin is smaller than people think — but for the ones that do, pick correctly the first time.
Common tools in production
- NGINX / HAProxy — open-source L7/L4. Run them yourself.
- AWS ALB / GCP HTTPS LB / Azure App Gateway — managed L7 with autoscaling, TLS, WAF integration.
- AWS NLB — managed L4, used in front of ALBs or for non-HTTP services.
- Envoy — modern proxy with xDS dynamic config; what Istio and Consul Connect use under the hood.
- kube-proxy / IPVS — Kubernetes' internal L4 LB, used for
Serviceresources.
Diagram conventions
When you draw a load balancer in an architecture diagram:
┌────────────┐
Client ──▶│ LB │──┬──▶ Server 1
│ │ ├──▶ Server 2
└────────────┘ └──▶ Server 3
A box with three forks. Label the LB with the algorithm (RR, LC, Weighted) if it matters for the discussion. In Mermaid:
flowchart LR
Client --> LB[Load Balancer<br/>least-conn]
LB --> S1[Server 1]
LB --> S2[Server 2]
LB --> S3[Server 3]
That's the diagram you'll draw in 90% of system-design interviews. Get fluent at it.
Tools in the wild
4 tools- serviceNGINXfree tier
The reference open-source L7 load balancer. Weighted, least-conn, hash, fair.
- serviceHAProxyfree tier
Higher-throughput L4/L7 LB; the choice when nginx can't keep up.
- service
Managed L7 load balancing — health checks, autoscaling, TLS termination.
- serviceEnvoyfree tier
Modern proxy used by Istio/Consul; xDS-driven dynamic config.