system-design · level 6

Rate Limiting

Token bucket, leaky bucket, sliding window — protecting services from overload.

150 XP

Rate Limiting

A rate limiter caps how fast a client (or a class of clients) can hit your service. Without one, a single misbehaving caller — or an honest caller with a runaway loop — can starve every other user. With a good one, your fleet stays healthy under any traffic shape, and abusive clients hit a clear, polite ceiling instead of taking down the system.

Why rate limiters exist

The point isn't punishment. It's three things, in order:

Protect the backend from being overwhelmed when input outpaces capacity.
Protect other users from one user (or one bot) hogging shared resources.
Surface a clear contract — clients know exactly what their budget is and how to back off.

A rate-limited 429 is infinitely better than a timeout. The client can react, retry intelligently, or buy a higher-tier plan. A timeout teaches the client nothing.

The four algorithms you must know

Token bucket

A bucket holds up to C tokens. Tokens refill at rate R per second. Every request consumes 1 token. If the bucket has zero, the request is rejected.

capacity C = 50, refill R = 10/s
  bucket: [#####.....] = 5 tokens
  request → consume 1 → 4 tokens left
  no request for 1s → +10 tokens, capped at 50

Token bucket allows bursts up to capacity, then bleeds at the steady rate. This matches how real clients behave — they wake up, fire a flurry of requests, then settle into steady state. AWS, GitHub, Stripe all use token-bucket variants.

It's the right default for almost every public API.

Leaky bucket

A queue with a constant drain rate. Requests enter the queue; the system services them at a fixed pace. If the queue is full, new requests are dropped.

The output rate is perfectly smooth — no bursts ever, even if the input is bursty. Use this when downstream systems can't handle uneven load (legacy SAP integrations, slow third-party APIs, hardware controllers).

The trade-off: latency is unpredictable. A request that arrives when the queue is half full waits longer than one that arrives when the queue is empty.

Fixed window

The simplest algorithm. Count requests in the current window (e.g. minute), reset to zero at the boundary.

INCR rate:user:42:202604281137
EXPIRE rate:user:42:202604281137 60
if count > 100: reject

The flaw: a client can send 100 requests at 11:37:59 and another 100 at 11:38:00 — 200 requests in 2 seconds, both technically within the 100-per-minute limit. For most APIs, this edge-doubling is fine. For ones where 2× the limit briefly is genuinely dangerous, fixed-window is the wrong choice.

Sliding window (counter or log)

Keep a window that slides as time advances, not one that resets at boundaries.

Sliding-log: store a timestamp per request, drop entries older than the window. Most accurate, most memory-expensive (one entry per request).
Sliding-counter: weighted blend of the current and previous fixed-window counts. Cheap, correct enough for nearly every use case. The Cloudflare blog post on this is the reference.

Sliding window fixes the boundary-doubling bug at the cost of slightly more state per key. Worth it for paying-customer-facing APIs.

Distributed rate limiting

Single-process counters are easy. Across a fleet of N servers, you have a synchronisation problem: two requests on different boxes arriving at the same instant must agree on the global count.

The two patterns:

Redis INCR

Cheap, fast, racy. Each box does:

count = redis.INCR("rl:user:42:202604281137")
redis.EXPIRE("rl:user:42:202604281137", 60)   // first one wins
if count > 100: reject

This works for fixed-window counters where being slightly over (~1%) on the boundary is fine.

Lua script (atomic CAS)

For token bucket or sliding window, you need atomic read-modify-write. Redis' Lua scripting runs the script as a single atomic operation:

local key = KEYS[1]
local cap = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(bucket[1]) or cap
local ts = tonumber(bucket[2]) or now
local delta = math.max(0, now - ts) / 1000
tokens = math.min(cap, tokens + delta * rate)
if tokens < 1 then return 0 end
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
return 1

This is what redis-cell, node-rate-limiter-flexible, and most production rate limiters do under the hood. The script runs in tens of microseconds and gives you correct token-bucket semantics across an arbitrary fleet.

What to limit on

Three keys, applied as a stack:

Per IP — coarse, defends against floods from a single source. Easy to defeat (rotating proxies) but cheap.
Per user / per API key — the right granularity for paying customers. Different tiers can have different buckets.
Per route — the expensive endpoint (/search) gets a tighter limit than the cheap one (/healthz).

In practice you stack them: a request must pass IP-level, user-level, and route-level limiters. Any one rejecting → 429.

The headers

When you reject (or even when you accept), tell the client where they stand:

HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 23
Retry-After: 23

429 Too Many Requests — the canonical status. Distinct from 503 Service Unavailable, which means you are overloaded; 429 means they exceeded their share.
Retry-After: 23 — seconds until they can try again. Polite clients will respect it; libraries (axios, requests-with-retries, AWS SDK) read it automatically.
RateLimit-* headers (an IETF draft) — let well-behaved clients see their budget without trial-and-error.

A rate limiter without these headers is hostile UX.

429 vs 503

A common confusion:

429 Too Many Requests — the client is over their share. Other clients are still being served normally. Throttle yourself.
503 Service Unavailable — the server is overloaded. Everyone is being throttled. The fleet itself is in trouble.

Get this right and your monitoring dashboards make sense. A 429 spike means a noisy neighbour. A 503 spike means a capacity problem.

Burst handling — why it matters

Real clients aren't smooth. A mobile app wakes up after a notification, fires 30 requests in 200ms to refresh the feed, then idles for a minute. A fixed leaky bucket would reject most of those even though the long-run rate is well under quota.

Token bucket handles this gracefully:

Bucket capacity 50, refill 10/s.
Long idle → bucket fills to 50.
Burst of 30 → all 30 pass, bucket drops to 20.
Steady state at 5 req/s → bucket refills, no rejections.

The burst capacity is the UX. Set it too low and you reject legitimate traffic; set it too high and you defeat the purpose.

What can go wrong

Counters that never expire. A fixed-window counter without EXPIRE accumulates one Redis key per minute per user forever. Always set the TTL.

Clock skew. Token-bucket math depends on now(). If two boxes disagree by 5 seconds, one will think the bucket has 50 extra tokens. Use the Redis server's clock (TIME), not the app server's.

Hot-key contention. Every request from one celebrity user hits the same Redis key. Shard by user-id within the rate-limit cluster, or move to a near-cache layer with eventual reconciliation.

Race on first request. Two boxes both INCR a freshly-expired key, both get count=1, both pass — but they should have shared the limit. Lua-script the read-modify-write.

Permissive clients ignoring Retry-After. Some libraries retry immediately. Combine 429 with a denyset on the rejecting LB layer if you have abusive clients (NGINX limit_req, AWS WAF rate-based rules).

Picking an algorithm

A reasonable progression:

Start with fixed-window + Redis INCR at the gateway. Trivial, handles 95% of cases.
Move to token bucket via Lua script when bursts matter (mobile clients, paying customers).
Reach for sliding window when the fixed-window edge-doubling is genuinely a problem (security-sensitive APIs).
Add leaky bucket in front of fragile downstream systems that genuinely cannot handle bursts.

Avoid the trap of building a "hybrid sliding-window-with-token-bucket-burst-credits-per-route" until you have evidence you need it.

Common tools in production

Redis — the canonical backend. INCR, EXPIRE, Lua scripts. redis-cell ships token-bucket as a single command.
Envoy rate-limit service — gRPC service that Envoy/Istio call out to; centralised limits across a service mesh.
AWS WAF rate-based rules — IP-based limiting at the edge, in front of the LB. Cheap protection from scrapers.
Kong / Tyk / Apigee — API gateways with built-in rate-limit plugins (local, cluster, Redis-backed).
NGINX limit_req — leaky-bucket directive that ships in OSS NGINX. Surprisingly capable.

Diagram conventions

A rate limiter sits between the LB and the service:

Client ─▶ LB ─▶ [Rate Limit] ─▶ Service
                  │
                  └─▶ Redis (counters / token buckets)

In Mermaid:

flowchart LR
    C[Client] --> LB
    LB --> RL[Rate Limit<br/>token-bucket]
    RL -->|allow| Svc[Service]
    RL -->|429| C
    RL -.-> R[(Redis)]

That's the diagram you want in your head every time someone asks "what protects this endpoint from abuse?"

Tools in the wild

4 tools

Redisfree tier
The canonical distributed rate-limit backend. INCR + EXPIRE, or Lua scripts for atomic token buckets.
service
Envoy rate-limit servicefree tier
Centralised rate-limit gRPC service used by Envoy/Istio service meshes.
service
AWS WAF rate-based rules
L7 IP-based rate limiting at the edge — protect origin from scraper/bot floods.
service
Kong rate-limiting pluginfree tier
API-gateway plugin with local, cluster, or Redis-backed counters.
service