networking · level 7

Load Balancer Anatomy

L4 vs L7 internals — conntrack, hashing, slow start, DSR.

200 XP

Load Balancer Anatomy

The "Load Balancing" lesson in System Design covers algorithms (round-robin, least-conn, weighted, hash). This lesson is about what the LB actually does at the OS level — connections, kernel tables, MAC rewrites, and the operational quirks that bite production.

L4 vs L7 — what each sees

A Layer-4 LB sits at the transport layer. It sees:

  • The 5-tuple: (src_ip, src_port, dst_ip, dst_port, protocol).
  • TCP flags (SYN, ACK, FIN, RST).
  • Bytes flying past, but it doesn't parse them.

It cannot do path-based routing, host-based routing, header rewrites, or HTTP retries — those are application-level. It CAN do raw throughput at hundreds of Gbps with cheap hardware.

A Layer-7 LB terminates TLS, parses HTTP, and routes by:

  • Host header (api.example.com → users-svc, static.example.com → CDN bucket).
  • Path prefix (/v1/users → users-svc, /v1/orders → orders-svc).
  • Method (GET → cache-friendly path, POST → durable path).
  • Custom headers (A/B testing, canary routing).

It can rewrite, retry on 5xx, smear traffic with circuit breakers, modify response headers — all of HTTP's semantics are on the table. It costs CPU and limits throughput compared to L4.

conntrack — the kernel table that bites

On Linux, the netfilter conntrack subsystem tracks every active flow through a NAT/load-balancing host. Each row holds the flow's 5-tuple, NAT translation, and timeout. A single host's table has a hard ceiling (net.netfilter.nf_conntrack_max).

When that table fills, new connections silently fail. Not with a helpful error — packets are just dropped. Symptoms in production:

  • p99 latency spikes for cold connections.
  • Logs full of nf_conntrack: table full, dropping packet (if you've turned that on).
  • TCP RST in random places.
  • Random services intermittently unreachable.

The defaults are sized for a desktop, not an LB. Production tuning:

# Inspect:
sudo sysctl net.netfilter.nf_conntrack_max         # current ceiling
sudo sysctl net.netfilter.nf_conntrack_count       # current usage
sudo conntrack -L | wc -l                          # active flows

# Bump:
echo 1048576 | sudo tee /proc/sys/net/netfilter/nf_conntrack_max
echo 524288  | sudo tee /proc/sys/net/nf_conntrack_max

Bump it before you melt; tune the timeouts (nf_conntrack_tcp_timeout_*) so dead flows are reaped quickly. Many production outages have one root cause: conntrack at 100%.

Hashing modes for stickiness

When you need the same client to land on the same backend (sticky sessions, stateful uploads), the LB hashes some part of the request. Modes:

  • 5-tuple hash (L4 default): hashes the full (src_ip, src_port, dst_ip, dst_port, proto). Different connections from the same client will hash differently — useful for spreading, useless for stickiness across reconnects.
  • Source-IP hash (L4): hashes only src_ip. Same client → same backend across all connections. Breaks if the client is behind a NAT shared by many users (everyone hashes to the same backend).
  • Header hash (L7): hashes a specific HTTP header — Authorization, X-User-ID, a session cookie. The right tool for sticky sessions because it doesn't depend on the network topology.
  • Cookie-based stickiness (L7): the LB injects a cookie containing the backend ID; subsequent requests carry the cookie. Maximum control, but the cookie leaks the backend identifier.

Pick the mode that matches what you actually need: source-IP for naive stickiness, header hash for application-level stickiness, cookie for explicit affinity.

Direct Server Return (DSR)

The default L4 LB topology terminates connections at the LB and opens new ones to the backend:

Client → LB → Backend → LB → Client

Fine for HTTP. Awful for high-bandwidth flows (video, large downloads) — the LB has to push all that traffic out twice.

DSR rewrites only the destination MAC address; the destination IP stays as the LB's. Backends are configured with the LB's IP as a loopback alias and reply directly to the client:

Client → LB (rewrites MAC) → Backend (sees its loopback IP, replies directly to client)
                                        ↓
Client ◀────────────────────── Backend (return path bypasses LB)

The LB sees only the inbound half of the conversation; the return path goes straight from backend to client. Throughput skyrockets. Used by GitHub's GLB Director, Cloudflare Magic Transit, AWS NLB internally.

Limits: backends and LB must be on the same L2 segment (you're rewriting at L2). Doesn't work across cloud regions.

Slow start

A freshly-restarted backend is cold:

  • Empty connection pool to the database.
  • JIT not warm.
  • In-memory caches empty.
  • DNS resolutions cached for the TTL but recently-evicted-on-restart.

Sending it 100% of fair-share load the moment it joins the pool will tip it over: response times spike, the LB marks it unhealthy, removes it, the surviving backends now take MORE traffic, the cycle repeats.

Slow start ramps a new backend's effective weight from a small fraction (10%) to 100% over a window (typically 30-60 seconds):

  • NGINX: slow_start=30s on the server line.
  • HAProxy: slowstart 30s on the server line.
  • AWS ALB: slow_start.duration_seconds on the target group.
  • Envoy: slow_start_config per cluster.

Turn it on. The cost is one fewer warm backend during a deploy; the benefit is no cold-start meltdown.

Health checks — what good looks like

Three properties:

  1. Cheap — runs every 5 seconds across N LBs across M backends. A 100ms check across 200 servers from 10 LBs is a steady ~4 RPS just for health checks. A 500ms check is bad.
  2. Honest — touches the actual dependencies. A /healthz that returns 200 even when the database is down is a lie. The right shape: SELECT 1 on the DB, PING on the cache, return 200/503 based on the result.
  3. Doesn't lie about itself — distinguish between "unhealthy because the dependency is down" (and removing me from the pool helps) versus "unhealthy because I'm overloaded" (and removing me makes the others MORE overloaded). Some stacks expose /readyz for the former and /livez for the latter.

What can go wrong, summarised

  • Conntrack-full drops packets silently.
  • Cold-start meltdown when slow start is off.
  • Sticky-session loss when a backend dies and the client gets bound to a new backend mid-session.
  • DSR breakage when backend configuration drifts (loopback alias missing).
  • Async health checks lying — health endpoint returns 200 from a static check while the real handler can't reach the DB.
  • TCP keepalive mismatch between LB and backend — connections "alive" from the LB's perspective but timed out at the backend, sending requests into a void.

Each of these is a distinct lesson in the operations folder of someone's run-book. Knowing them all is what separates "I configure an LB" from "I run an LB".

What you should remember

L4 sees packets and 5-tuples; L7 sees HTTP. The kernel tracks every flow in conntrack — bump the ceiling. Use header-hash stickiness, not source-IP. Turn on slow start. Make health checks cheap and honest. Configure DSR if your throughput justifies it.

The "load balancing" you encounter as a developer is the algorithm. The "load balancing" you encounter as a sysadmin is everything in this file.

Tools in the wild

5 tools
  • HAProxyfree tier

    Mature open-source L4/L7 LB; the choice when raw throughput matters.

    service
  • Envoyfree tier

    Modern proxy with xDS dynamic config; the data plane behind Istio + Consul Connect.

    service
  • Managed L4 LB — flow-tuple hash, source-ip stickiness, DSR-style mode.

    service
  • Anycast L4 LB across Cloudflare's network.

    service
  • Kubernetes' L4 LB for ClusterIP services. IPVS mode handles 100k+ rules.

    library