← Bank
System Design

Design a Notification System

System DesignSenior~45m
system-designqueuesidempotencydelivery-guarantees

Prompt

Right now every team in our company that needs to ping a user — password resets, "your order shipped," fraud alerts, marketing blasts — hand-rolls its own email or push code, and it's a mess: people get duplicate alerts, sometimes ten in a row, and a fraud alert once got stuck behind a marketing batch. Leadership wants "one service that sends notifications" that every team calls. Design it.

How this round runs

"One service that sends notifications" hides almost everything: which channels, what happens when a send fails, and whether a duplicate alert is annoying or dangerous. You drive: surface those, then design it, and I'll push you hard on delivery guarantees and idempotency, and inject a downstream-provider failure.

Model answer

1. Requirements I'd surface first.

  • Which channels, and is channel chosen by the caller or by us? Push, email, SMS at least. I'd let the caller specify a notification type and have the service map type → channels + user preferences, rather than callers hard-coding channels.
  • Delivery guarantee per type. A fraud alert and a marketing blast want opposite things: the alert must arrive (at-least-once, retried hard); the blast must not spam (rate-capped, droppable). So "guarantee" is per-type, not global.
  • Dedup: the complaint is "ten in a row" — so the service needs idempotency (same logical event sent twice = one notification) and rate caps per (user, type).
  • Ordering / priority: the fraud-alert-stuck-behind-marketing story means priority lanes — a critical notification must not queue behind a million marketing ones.
  • User preferences + quiet hours + unsubscribe (legal requirement for marketing).
  • Scale: say 10M users, bursts of millions of notifications (a marketing send), plus a steady trickle of transactional ones.

Non-functional: a transactional notification must be effectively never-lost; the system must not double-send; one team's million-message blast must not delay another team's fraud alert.

2. High-level design.

callers --> ingest API (validates, assigns dedup key) --> priority queues
                                                              |  (critical / transactional / bulk lanes)
                                          per-channel workers (push / email / SMS)
                                                |             |
                                        provider adapters   retry w/ backoff --> DLQ
                                                |
                                   preference + rate-cap + dedup store
  • Ingest API validates, applies user preferences, and stamps a dedup key.
  • Separate priority queues / lanes so critical never waits behind bulk — this directly fixes the "fraud alert stuck behind marketing" complaint.
  • Per-channel workers pull from the queue and call provider adapters (an abstraction over SendGrid/Twilio/APNs so a provider swap is a config change).
  • Retry with exponential backoff, and a dead-letter queue for sends that exhaust retries, so nothing silently vanishes and ops can inspect failures.
  • A dedup + rate-cap store the workers consult before sending.

3. Deep-dive: delivery guarantees + idempotency. This is the crux, and the two problems are joined at the hip.

The guarantee. End-to-end exactly-once is not achievable — providers can accept a message and then time out before we learn they did, so we cannot know whether to retry. So I commit to at-least-once delivery + idempotent send, which is the honest, achievable version of "don't lose it, don't duplicate it."

Idempotency, concretely. Two duplication sources, handled separately:

  • Caller-side duplicates ("ten in a row"): the caller supplies (or the ingest API derives) a dedup key per logical event — e.g. order-4711-shipped. The ingest API records "seen this key" atomically; a second submission of the same key is a no-op. That collapses the ten-in-a-row at the front door.
  • Our-own retry duplicates: when a worker retries a send it might be retrying something the provider actually delivered (the ack was lost). I defend this by passing a provider-level idempotency token (most providers support one) tied to the dedup key, so a retried send the provider already saw is dropped provider-side. Where a provider has no idempotency support, I accept that channel is best-effort at-least-once and the user might rarely see a dup — and I say that out loud.

Rate caps live next to dedup: per (user, type) caps ("no more than N price-drop alerts/hour") so even distinct events don't become spam.

4. A committed trade-off and its cost. I'd commit to at-least-once with a dedup key + provider idempotency token, accepting that on a channel whose provider can't honor an idempotency token, a user may very rarely get a duplicate after a lost-ack retry. The cost I name out loud: I am choosing "occasionally one duplicate" over "occasionally zero notifications," because for a fraud alert a missed message is far worse than a repeated one. For marketing, where a duplicate is more embarrassing than a miss, I'd flip the dial — fewer retries, drop on doubt — and I'd make that retry aggressiveness a per-type policy rather than one global setting.

5. Operational concerns / injected failure. Failure you're about to hand me: a downstream provider (say the SMS gateway) goes down or starts erroring. If workers naively keep retrying, they hammer a dead provider and the queue backs up across all channels. I defend with a circuit breaker per provider — trip it on elevated error rate, stop sending to that provider, and either fail over to a secondary provider or park those messages (not the whole queue) until it recovers. Messages that exhaust retries go to the DLQ, not the void, so they can be replayed once the provider heals. I'd detect it on per-provider success-rate and queue-depth-by-lane dashboards, alerting when the critical lane's depth climbs. Rollback: provider routing is config, so I can reroute a channel to a backup provider without a deploy.

Signals — what a strong answer shows
  • Made delivery guarantees per-type (fraud alert retried hard, marketing droppable) instead of one global promise
  • Used priority lanes to keep critical notifications from queuing behind bulk sends
  • Distinguished caller-side dedup (dedup key) from retry-side dedup (provider idempotency token)
  • Committed to at-least-once and named the cost (rare duplicate over a missed alert), tunable per type
  • Added a circuit breaker + DLQ for provider failure unprompted, so a dead provider doesn't stall everything
Follow-ups — where it goes next
  • 'A worker retries a send the provider already delivered — how do you not double-send?' → provider idempotency token tied to the dedup key; lost-ack retries are dropped provider-side
  • 'A team sends a million marketing messages — how does a fraud alert still go out instantly?' → separate priority lanes, so bulk never blocks critical
  • 'The SMS provider is down' → circuit-break that provider, fail over or park its messages, DLQ exhausted ones — don't stall other channels
  • 'Fan-out-on-write vs read for a broadcast to all users?' → write to a queue once and let workers expand the audience, rather than the caller enqueuing 10M rows synchronously