Rate Limiting Algorithms
Prompt
You need to cap an API at roughly N requests per second per client. Walk me through how a token bucket decides whether to admit a request, how that differs from a leaky bucket, and which one you'd pick if clients are bursty.
How this round runs
This is a conversation that escalates — I'll keep asking "and what happens when…" until I find the edge of what you know. I want the admission mechanism for each algorithm and the trade-off between them, then I'll push it into the distributed case where the interesting failures live.
Model answer
I'll describe the admission rule for each, then the trade-off, then the consequence that
picks between them. Token bucket: a bucket holds up to capacity tokens and refills at a
steady rate, say r per second. Each request removes a token; if the bucket is empty the
request is rejected (or queued). The key behavior is that idle time accumulates tokens up
to the cap, so a client that's been quiet can spend a burst all at once — admission is
"average rate r, burst up to capacity."
Leaky bucket inverts that: requests go into a queue drained at a fixed rate, so output is perfectly smooth and bursts are flattened — they wait in the queue or overflow and drop. So the trade-off is burst tolerance vs. smoothness: token bucket admits bursts up to its cap (good when traffic is naturally spiky and the downstream can absorb a short surge); leaky bucket guarantees a constant outflow (good when the downstream has a hard throughput ceiling you must never exceed). For bursty clients on an API I'd reach for token bucket — it lets a client that was idle catch up without me having to over-provision, while still bounding the long-run average.
The honest edge is where this stops being a single-process problem. In a distributed fleet, each node enforcing its own bucket means the real global limit is roughly N × the node count, so the counter has to live in a shared store (Redis) and the decrement has to be atomic — a read-then-write race lets two nodes both admit on the last token. And the refill math depends on time, so clock skew between nodes, or a coarse window, smears the boundary. I can reason about those; the exact tuning (sliding-window-log vs. sliding-window-counter to kill the fixed-window edge spike) is where I'd benchmark rather than assert.
- Gave the admission rule for each (token: refill + spend; leaky: fixed-rate drain) before comparing
- Named the precise trade-off: burst tolerance up to capacity vs. perfectly smooth output
- Picked token bucket for bursty clients with a reason tied to a consequence, not a preference
- Escalated unprompted to the distributed case: per-node buckets multiply the real limit
- Honest boundary: atomic decrement in a shared store and clock skew, deferring exact window-algorithm tuning to benchmarking
- And what happens when you run this across ten API nodes each with its own bucket? → the effective limit becomes ~N×10; move the counter to a shared store with an atomic decrement
- Two nodes read 'one token left' at the same instant — what stops a double-admit? → the check-and-decrement must be atomic (Lua script / INCR-with-expiry), not read-then-write
- Why does a naive fixed window let a client get ~2× the limit at the boundary? → it resets at a hard edge; a sliding window or token bucket smooths it