← Bank
System Design

Design a Rate Limiter

System DesignMid~40m
system-designalgorithmsdistributed

Prompt

Our public API is getting hammered — a couple of clients are sending bursts that knock over a downstream service, and one customer's runaway script ran up a huge bill last week. Product wants us to "limit how much any one client can call us." Design the thing that does that.

How this round runs

The brief says "limit how much" and stops there — it doesn't say how much, per what, single box or fleet, or what should happen to a request that's over the line. You drive: surface those, then design it, and I'll make you pick one algorithm and defend its cost, and I'll inject a failure once you have a design.

Model answer

1. Requirements I'd surface first.

  • Limit per what? Per API key, per IP, per (key, endpoint)? It matters: per-IP punishes everyone behind a NAT; per-key is what a billing-abuse story actually wants. I'd go per API key, optionally per-endpoint for expensive routes.
  • What's the limit and shape? "100/min steady" and "1000/min steady but no more than 50 in any second" are different products. I need to know if bursts are allowed.
  • Single node or a fleet? This is the whole crux. One process can hold counters in memory; a fleet of N gateway nodes cannot — a per-node count of "100" means the real limit is 100×N. I'd assume a fleet, because that's the only version that's hard.
  • What happens on reject? 429 with a Retry-After and X-RateLimit-* headers so well-behaved clients back off instead of hot-looping.
  • Fail open or closed? If the limiter's datastore is down, do we let traffic through (protect availability) or block it (protect the downstream)? A real decision I'd force product to make.

Non-functional: the check is on every request, so it must add single-digit-ms latency and itself never become the bottleneck or single point of failure.

2. High-level design.

client --> API gateway (limiter middleware) --> upstream service
                  |
            shared counter store (Redis), keyed by api-key

The gateway runs the limiter as middleware; the counters live in a shared store (Redis) so all gateway nodes see one global count per key. The store has to support an atomic read-modify-write — that's the part people skip.

3. Deep-dive: the distributed-counter race, and the algorithm. Naively, each node does GET count; if under limit, INCR. Two nodes interleave that between the GET and the INCR and both pass the check — the limit leaks. So the check-and-increment must be atomic at the store. I'd commit to a token bucket evaluated atomically in a single Redis Lua script (or, for a fixed window, an INCR whose EXPIRE is set in the same atomic step — otherwise a crash between the INCR and the EXPIRE leaks a TTL-less key that blocks that caller forever): the script reads tokens, refills based on elapsed time, decrements, and writes back as one indivisible operation, so concurrent requests across nodes can't both spend the last token.

Algorithm choice, committed: token bucket. Why and what it costs —

  • Token bucket allows controlled bursts (good: real clients are bursty) and is cheap (two numbers per key: tokens, last-refill). Cost: it permits a burst up to the bucket size, so it does not enforce a strict "never more than N in any window" — a client can drain the bucket instantly.
  • I'd reject fixed window as the primary because of its boundary problem: 100 at 59.9s and 100 at 60.1s is 200 requests in 0.2s — a thundering herd at every window edge.
  • Sliding-window log is the strictest but stores a timestamp per request — too much memory at scale. Sliding-window counter is the middle ground I'd mention as the upgrade if strictness matters more than the burst tolerance.

4. A committed trade-off and its cost. Token bucket, and the cost I accept out loud: I am explicitly allowing short bursts above the steady rate, which means a client can briefly exceed the nominal per-second rate up to the bucket size. For an API protecting a downstream that can absorb brief spikes that's the right trade — burst tolerance keeps legitimate bursty clients happy. If the downstream genuinely cannot take a burst, I'd shrink the bucket toward 1 (approaching a strict rate) and pay the cost of rejecting legitimate bursts.

5. Operational concerns. Failure mode you're about to hand me: Redis goes down. Now every request can't reach the counter. I'd detect it on limiter-store error rate and decide fail-open vs fail-closed per the product call from step 1 — for a public API I'd fail open with a per-node local fallback bucket (so we still shed the worst abusers using stale per-node limits) rather than 500-ing every request and taking the whole API down with the limiter. I'd alert loudly because fail-open means limits are temporarily soft. Rollback: limiter is config-driven (limits in config, not code), so loosening or disabling a misbehaving rule is a config push, not a deploy.

Signals — what a strong answer shows
  • Surfaced limit-key, burst shape, single-vs-fleet, and reject behavior before designing
  • Identified the check-then-increment race across nodes and made it atomic at the store
  • Committed to token bucket and named its cost (it tolerates bursts, not a strict per-window cap)
  • Decided fail-open vs fail-closed deliberately instead of ignoring the limiter's own failure
  • Specified 429 + Retry-After + rate-limit headers so clients can back off
Follow-ups — where it goes next
  • 'Two requests hit different nodes at the same instant — how do you not double-spend the last token?' → atomic check-and-decrement in one Lua script, not GET-then-INCR
  • 'What breaks at 10x gateway nodes?' → the shared Redis becomes the hot spot; shard counters by key, or move to local buckets that sync periodically
  • 'Redis is down' → the fail-open/closed call, plus a per-node local fallback
  • 'Why token bucket not sliding-window log?' → the log is strictest but stores a timestamp per request; commit and own the memory trade