Idempotency & Retries
Idempotency keys, exponential backoff with jitter, circuit breakers.
Idempotency & Retries
Distributed systems lie. Networks drop packets, peers crash mid-write, ACKs vanish silently. Every long-running production service is built around the assumption that the same request might be delivered more than once. The two patterns that turn that assumption from terrifying to boring are idempotency and disciplined retries.
Get them right and your system shrugs off transient failure. Get them wrong and you double-charge customers, send the same email three times, or thunder-herd a recovering backend into a fresh outage.
The exactly-once illusion
You can't have exactly-once delivery in a distributed system. The proof is short: any acknowledgement can itself be lost, so the sender can't know whether the receiver got the message. So real systems run at-least-once delivery on the wire and make the consumer idempotent — same effect whether the message arrives once or seven times.
That combination — at-least-once transport + idempotent consumer — is what every "exactly-once" claim in marketing material actually means. Kafka exactly-once, AWS SQS FIFO, Stripe payments — all built on this pattern.
Idempotency keys
The standard solve: the client generates a unique ID and sends it on the request. The server stores (key → response) for some TTL. Identical retries return the cached response without re-running the work.
POST /v1/charges
Idempotency-Key: 9f8b7e2c-1a4d-4e6f-9b8a-7c1d2e3f4a5b
Content-Type: application/json
{ "amount": 1999, "currency": "usd", "source": "tok_visa" }
→ 200 OK { "id": "ch_abc", ... }
# Network drops the response. Client retries with the same key.
POST /v1/charges
Idempotency-Key: 9f8b7e2c-1a4d-4e6f-9b8a-7c1d2e3f4a5b
{ "amount": 1999, "currency": "usd", "source": "tok_visa" }
→ 200 OK { "id": "ch_abc", ... } # SAME response, no double charge
Stripe's reference implementation uses a 24-hour TTL — long enough that a client crashing and restarting still hits the cache, short enough that storage doesn't grow unbounded.
The race that bites people: two clients with the same key arrive simultaneously. Solve with SET NX (set if not exists) on a lock key around the work — only one runs the handler; the other waits or gets a 409.
Server-side dedupe
Sometimes the client can't supply a key (legacy systems, third-party webhook callers). Then the server has to dedupe. Two patterns:
- Hash the canonical body — compute SHA-256 of the normalised request body + a few header fields. Identical hash within a TTL → duplicate. Cheap; works only if the body is genuinely identical.
- Transactional outbox — when the API handler writes to its database, it also writes a row to an
outboxtable in the same transaction. A separate worker reads the outbox and publishes to downstream systems. If the worker retries, the outbox row'sprocessed=trueflag dedupes.
Server-side dedupe is strictly worse than client-supplied keys (the server can't know if two requests are the same logical operation if the bodies happen to differ). Reach for it only when the client can't help.
When you don't need any of this
Idempotency is only interesting for state-mutating operations. Pure reads (GET /users/42) are idempotent by definition. Operations that are commutative (INCR view_count) are idempotent if you tolerate the count being slightly off — though most of the time you don't tolerate it, and you do want a key.
Operations that aren't naturally idempotent and that you don't bother to make idempotent are the ones that bite. "Send notification" — without a key, a retry sends the email twice. "Charge card" — without a key, you double-bill. "Create user" — without a key, you have two users with the same email.
The rule: anything that costs the user money, sends them a message, or creates a record they'll see — needs an idempotency key.
Retries: the algorithm matters
The naive approach — retry once, immediately, on any error — is the worst possible behaviour. It doubles the transient load on a struggling backend at the moment it's least able to handle it.
The right approach is exponential backoff with jitter.
Exponential backoff
Wait progressively longer between retries:
attempt 1: wait 1s → retry
attempt 2: wait 2s → retry
attempt 3: wait 4s → retry
attempt 4: wait 8s → retry
attempt 5: wait 16s → retry
Total elapsed waiting: 1+2+4+8 = 15 seconds before the 5th attempt. Each retry gives the failing system more time to recover, and limits the number of retries you'll fire in any short window.
Why pure exponential isn't enough
Imagine a brief outage at t=0. 1000 clients all see the failure and start retrying. Without jitter, every client retries at exactly t=1, then t=3 (1+2), then t=7 (1+2+4), then t=15...
This is a retry storm. The recovering backend is hit by 1000 simultaneous requests at every backoff boundary. It fails again, every client backs off again to the same coordinated schedule, the cycle repeats until something gives up or the schedule drifts apart by chance.
Jitter
Add randomness so the retries spread out:
- Equal jitter:
wait = backoff/2 + random(0, backoff/2). Each retry happens in the second half of its window. - Full jitter:
wait = random(0, backoff). Each retry happens uniformly anywhere in [0, backoff]. - Decorrelated jitter:
wait = random(base, prev_wait * 3). Adapts to observed conditions; AWS's recommendation for high-contention systems.
The Marc Brooker / AWS analysis is unambiguous: full jitter is strictly better than no jitter for cascading-failure prevention. Equal jitter is fine. Anything without jitter is dangerous.
When NOT to retry
Some failures should never be retried because retrying can't possibly help — and might make things worse:
- 400 Bad Request — the request is malformed. Sending it again won't fix the malformation.
- 401 Unauthorized — your credentials are wrong. Retrying with the same credentials is silly.
- 403 Forbidden — you're not allowed. Same answer next time.
- 404 Not Found — the resource doesn't exist. (Unless you have evidence of read-after-write lag.)
- 409 Conflict — there's a state conflict. The server needs human/business resolution, not a retry.
Status codes that ARE worth retrying:
- 408 Request Timeout — your request didn't reach in time. Try again.
- 429 Too Many Requests — rate limited. Honour
Retry-After. - 502 / 503 / 504 — server-side transient. Try again with backoff.
- 5xx in general — assume transient unless proven otherwise.
A retry library that retries on every status is a footgun. AWS SDK, axios-retry, requests-with-retries all default to "retry 5xx + 429 only" — that's the right list.
Circuit breakers
Even disciplined retry has a flaw: if a backend is durably down (not transient), every client retries until exhausted, generating a wave of pointless load. The circuit breaker fixes this with a state machine.
closed (passing) ──fail count threshold──▶ open (failing fast, no calls)
│
timeout elapsed
▼
half-open (one trial call)
│ │
success │ │ failure
▼ ▼
closed open
- Closed: traffic flows. Each failure increments a counter. When the counter exceeds N consecutive failures (or a failure rate threshold), trip to open.
- Open: no traffic flows. Every call returns an immediate error without hitting the backend. After a configured timeout (often 30s), transition to half-open.
- Half-open: let exactly one trial call through. If it succeeds, transition to closed. If it fails, go back to open.
The half-open state is what distinguishes a circuit breaker from a simple killswitch — it's the recovery probe.
In production: resilience4j (Java), opossum (Node), Polly (.NET), circuit_breaker middleware (Go). Wrap any RPC, retry inside the closed/half-open states, fail fast when open.
Tying it all together
A request that hits a flaky backend looks like:
client → idempotency middleware → circuit breaker → retry-with-jitter → backend
│ │ │
└─ key in cache → return └─ open → 503 └─ retryable status?
└─ closed → call └─ exponential backoff
Each layer composes:
- Idempotency makes retries safe to perform.
- Backoff with jitter makes retries safe for the recipient.
- Circuit breaker stops retries from trying a hopeless backend.
Without idempotency, retries duplicate work. Without backoff, retries thunder-herd. Without a circuit breaker, a permanently-down backend never gets a chance to recover.
What can go wrong
Idempotency key TTL too short. A 1-hour TTL with a client that retries after a 2-hour outage = duplicate work. 24h is a sensible default.
Idempotency key collisions. UUIDs are fine; sequential integers are not. Collisions silently merge unrelated requests.
Retrying non-idempotent operations. "Send SMS" without a key + retry = two SMS. The cost is delivered to the user.
Storing idempotency cache in process memory. A pod restart wipes it; the next retry duplicates. Use Redis or the database.
Naive retry inside a loop. If your code retries each call AND the caller retries the whole flow, you've multiplied retries (5 × 5 = 25 attempts). Push retries to one layer.
Circuit breaker too sensitive. Trip-on-1-failure means a single network hiccup opens the circuit. Trip-on-failure-rate (e.g. >50% in a 30s window) is more robust.
Common tools in production
- Stripe Idempotency-Key header — the reference pattern for HTTP APIs.
- AWS SDK retries — built-in adaptive mode with full jitter; turn it on, walk away.
- resilience4j / opossum / Polly — circuit breaker libraries for the major language ecosystems.
- Temporal / AWS Step Functions — workflow engines that own retries + idempotency for you. Reach for these when retry logic spans hours/days.
- Kafka transactional producer + idempotent consumer — the canonical exactly-once-illusion for messaging.
Diagram conventions
The reliability stack between two services:
A ──▶ Idempotency ──▶ Circuit Breaker ──▶ Retry ──▶ B
(dedupe key) (closed/open) (backoff
+ jitter)
In Mermaid:
sequenceDiagram
A->>+B: POST (Idempotency-Key=K)
B-->>-A: 200 OK
Note over A: response lost
A->>+B: POST (Idempotency-Key=K) // retry
B-->>-A: 200 OK (cached)
Memorise the sequence. It's the answer to half the "how do you make this reliable?" interview questions.
Tools in the wild
4 tools- spec
The reference idempotency-key pattern. 24h TTL, returns cached response on retry.
- library
Built-in adaptive retry with full jitter. The standard mode for production AWS workloads.
- libraryresilience4jfree tier
Java circuit breaker, retry, bulkhead, time-limiter. Lightweight, modular, ubiquitous on the JVM.
- libraryopossumfree tier
Node.js circuit breaker. Wraps any async function; trips on consecutive failures.