system-design · level 7

Idempotency & Retries

Idempotency keys, exponential backoff with jitter, circuit breakers.

150 XP

Idempotency & Retries

Distributed systems lie. Networks drop packets, peers crash mid-write, ACKs vanish silently. Every long-running production service is built around the assumption that the same request might be delivered more than once. The two patterns that turn that assumption from terrifying to boring are idempotency and disciplined retries.

Get them right and your system shrugs off transient failure. Get them wrong and you double-charge customers, send the same email three times, or thunder-herd a recovering backend into a fresh outage.

The exactly-once illusion

You can't have exactly-once delivery in a distributed system. The proof is short: any acknowledgement can itself be lost, so the sender can't know whether the receiver got the message. So real systems run at-least-once delivery on the wire and make the consumer idempotent — same effect whether the message arrives once or seven times.

That combination — at-least-once transport + idempotent consumer — is what every "exactly-once" claim in marketing material actually means. Kafka exactly-once, AWS SQS FIFO, Stripe payments — all built on this pattern.

Idempotency keys

The standard solve: the client generates a unique ID and sends it on the request. The server stores (key → response) for some TTL. Identical retries return the cached response without re-running the work.

POST /v1/charges
Idempotency-Key: 9f8b7e2c-1a4d-4e6f-9b8a-7c1d2e3f4a5b
Content-Type: application/json
{ "amount": 1999, "currency": "usd", "source": "tok_visa" }

→ 200 OK { "id": "ch_abc", ... }

# Network drops the response. Client retries with the same key.
POST /v1/charges
Idempotency-Key: 9f8b7e2c-1a4d-4e6f-9b8a-7c1d2e3f4a5b
{ "amount": 1999, "currency": "usd", "source": "tok_visa" }

→ 200 OK { "id": "ch_abc", ... }    # SAME response, no double charge

Stripe's reference implementation uses a 24-hour TTL — long enough that a client crashing and restarting still hits the cache, short enough that storage doesn't grow unbounded.

The race that bites people: two clients with the same key arrive simultaneously. Solve with SET NX (set if not exists) on a lock key around the work — only one runs the handler; the other waits or gets a 409.

Server-side dedupe

Sometimes the client can't supply a key (legacy systems, third-party webhook callers). Then the server has to dedupe. Two patterns:

Hash the canonical body — compute SHA-256 of the normalised request body + a few header fields. Identical hash within a TTL → duplicate. Cheap; works only if the body is genuinely identical.
Transactional outbox — when the API handler writes to its database, it also writes a row to an outbox table in the same transaction. A separate worker reads the outbox and publishes to downstream systems. If the worker retries, the outbox row's processed=true flag dedupes.

Server-side dedupe is strictly worse than client-supplied keys (the server can't know if two requests are the same logical operation if the bodies happen to differ). Reach for it only when the client can't help.

When you don't need any of this

Idempotency is only interesting for state-mutating operations. Pure reads (GET /users/42) are idempotent by definition. Operations that are commutative (INCR view_count) are idempotent if you tolerate the count being slightly off — though most of the time you don't tolerate it, and you do want a key.

Operations that aren't naturally idempotent and that you don't bother to make idempotent are the ones that bite. "Send notification" — without a key, a retry sends the email twice. "Charge card" — without a key, you double-bill. "Create user" — without a key, you have two users with the same email.

The rule: anything that costs the user money, sends them a message, or creates a record they'll see — needs an idempotency key.

Retries: the algorithm matters

The naive approach — retry once, immediately, on any error — is the worst possible behaviour. It doubles the transient load on a struggling backend at the moment it's least able to handle it.

The right approach is exponential backoff with jitter.

Exponential backoff

Wait progressively longer between retries:

attempt 1: wait 1s  → retry
attempt 2: wait 2s  → retry
attempt 3: wait 4s  → retry
attempt 4: wait 8s  → retry
attempt 5: wait 16s → retry

Total elapsed waiting: 1+2+4+8 = 15 seconds before the 5th attempt. Each retry gives the failing system more time to recover, and limits the number of retries you'll fire in any short window.

Why pure exponential isn't enough

Imagine a brief outage at t=0. 1000 clients all see the failure and start retrying. Without jitter, every client retries at exactly t=1, then t=3 (1+2), then t=7 (1+2+4), then t=15...

This is a retry storm. The recovering backend is hit by 1000 simultaneous requests at every backoff boundary. It fails again, every client backs off again to the same coordinated schedule, the cycle repeats until something gives up or the schedule drifts apart by chance.

Jitter

Add randomness so the retries spread out:

Equal jitter: wait = backoff/2 + random(0, backoff/2). Each retry happens in the second half of its window.
Full jitter: wait = random(0, backoff). Each retry happens uniformly anywhere in [0, backoff].
Decorrelated jitter: wait = random(base, prev_wait * 3). Adapts to observed conditions; AWS's recommendation for high-contention systems.

The Marc Brooker / AWS analysis is unambiguous: full jitter is strictly better than no jitter for cascading-failure prevention. Equal jitter is fine. Anything without jitter is dangerous.

When NOT to retry

Some failures should never be retried because retrying can't possibly help — and might make things worse:

400 Bad Request — the request is malformed. Sending it again won't fix the malformation.
401 Unauthorized — your credentials are wrong. Retrying with the same credentials is silly.
403 Forbidden — you're not allowed. Same answer next time.
404 Not Found — the resource doesn't exist. (Unless you have evidence of read-after-write lag.)
409 Conflict — there's a state conflict. The server needs human/business resolution, not a retry.

Status codes that ARE worth retrying:

408 Request Timeout — your request didn't reach in time. Try again.
429 Too Many Requests — rate limited. Honour Retry-After.
502 / 503 / 504 — server-side transient. Try again with backoff.
5xx in general — assume transient unless proven otherwise.

A retry library that retries on every status is a footgun. AWS SDK, axios-retry, requests-with-retries all default to "retry 5xx + 429 only" — that's the right list.

Circuit breakers

Even disciplined retry has a flaw: if a backend is durably down (not transient), every client retries until exhausted, generating a wave of pointless load. The circuit breaker fixes this with a state machine.

closed (passing)  ──fail count threshold──▶  open (failing fast, no calls)
                                                   │
                                            timeout elapsed
                                                   ▼
                                          half-open (one trial call)
                                                 │  │
                                            success │  │ failure
                                                 ▼  ▼
                                              closed  open

Closed: traffic flows. Each failure increments a counter. When the counter exceeds N consecutive failures (or a failure rate threshold), trip to open.
Open: no traffic flows. Every call returns an immediate error without hitting the backend. After a configured timeout (often 30s), transition to half-open.
Half-open: let exactly one trial call through. If it succeeds, transition to closed. If it fails, go back to open.

The half-open state is what distinguishes a circuit breaker from a simple killswitch — it's the recovery probe.

In production: resilience4j (Java), opossum (Node), Polly (.NET), circuit_breaker middleware (Go). Wrap any RPC, retry inside the closed/half-open states, fail fast when open.

Tying it all together

A request that hits a flaky backend looks like:

client → idempotency middleware → circuit breaker → retry-with-jitter → backend
            │                          │                  │
            └─ key in cache → return   └─ open → 503      └─ retryable status?
                                       └─ closed → call    └─ exponential backoff

Each layer composes:

Idempotency makes retries safe to perform.
Backoff with jitter makes retries safe for the recipient.
Circuit breaker stops retries from trying a hopeless backend.

Without idempotency, retries duplicate work. Without backoff, retries thunder-herd. Without a circuit breaker, a permanently-down backend never gets a chance to recover.

What can go wrong

Idempotency key TTL too short. A 1-hour TTL with a client that retries after a 2-hour outage = duplicate work. 24h is a sensible default.

Idempotency key collisions. UUIDs are fine; sequential integers are not. Collisions silently merge unrelated requests.

Retrying non-idempotent operations. "Send SMS" without a key + retry = two SMS. The cost is delivered to the user.

Storing idempotency cache in process memory. A pod restart wipes it; the next retry duplicates. Use Redis or the database.

Naive retry inside a loop. If your code retries each call AND the caller retries the whole flow, you've multiplied retries (5 × 5 = 25 attempts). Push retries to one layer.

Circuit breaker too sensitive. Trip-on-1-failure means a single network hiccup opens the circuit. Trip-on-failure-rate (e.g. >50% in a 30s window) is more robust.

Common tools in production

Stripe Idempotency-Key header — the reference pattern for HTTP APIs.
AWS SDK retries — built-in adaptive mode with full jitter; turn it on, walk away.
resilience4j / opossum / Polly — circuit breaker libraries for the major language ecosystems.
Temporal / AWS Step Functions — workflow engines that own retries + idempotency for you. Reach for these when retry logic spans hours/days.
Kafka transactional producer + idempotent consumer — the canonical exactly-once-illusion for messaging.

Diagram conventions

The reliability stack between two services:

A ──▶ Idempotency ──▶ Circuit Breaker ──▶ Retry ──▶ B
       (dedupe key)    (closed/open)      (backoff
                                           + jitter)

In Mermaid:

sequenceDiagram
  A->>+B: POST (Idempotency-Key=K)
  B-->>-A: 200 OK
  Note over A: response lost
  A->>+B: POST (Idempotency-Key=K)  // retry
  B-->>-A: 200 OK (cached)

Memorise the sequence. It's the answer to half the "how do you make this reliable?" interview questions.

Tools in the wild

4 tools

Stripe Idempotency-Key header
The reference idempotency-key pattern. 24h TTL, returns cached response on retry.
spec
AWS SDK retry config
Built-in adaptive retry with full jitter. The standard mode for production AWS workloads.
library
resilience4jfree tier
Java circuit breaker, retry, bulkhead, time-limiter. Lightweight, modular, ubiquitous on the JVM.
library
opossumfree tier
Node.js circuit breaker. Wraps any async function; trips on consecutive failures.
library