Design a Chat Application
Prompt
We're adding messaging to our product. People should be able to message each other one-to-one and in group threads, see who's online, and — this is the part that keeps coming up in complaints from the prototype — actually receive messages reliably, including ones sent while they were offline. Design the messaging backend.
How this round runs
The brief names the symptoms (reliable delivery, offline, presence) but not the guarantees behind them. You drive: pin down ordering, delivery, and group sizes, then design it, and I'll push you hard on the connection/fan-out layer and on what happens when a server — or a region — goes down.
Model answer
1. Requirements I'd surface first.
- Delivery guarantee? "Reliably" has to become a contract: at-least-once delivery with client-side dedup, ordered per conversation. I'd explicitly not promise exactly-once across the wire (it's a myth without idempotency at the edge) and instead promise at-least-once + idempotent apply.
- Ordering scope? Global order is impossible and unnecessary; per-conversation ordering is what users actually perceive. I'd commit to a monotonic per-thread sequence.
- Group size? A 5-person DM and a 50,000-member channel are different fan-out problems. I'd ask for the cap — it decides fan-out-on-write vs on-read.
- Offline + multi-device? The complaint is offline delivery, so the message store is the source of truth and a connection is just a fast path. Multi-device means per-device read cursors, not per-user.
- Scale: say 50M users, 5M concurrent connections, peak 100k msgs/sec.
Non-functional: a message must survive a server crash (durable before ack), in-order per thread, and reach every device eventually even if it was offline at send time.
2. High-level design.
clients <==WebSocket==> connection/gateway tier (holds the live sockets)
|
message service --> durable log per conversation (source of truth)
| |
presence (Redis, TTL) fan-out to recipients' devices
|
routing: which gateway holds device D? (session registry)
The spine: a client sends over a persistent WebSocket to a connection gateway; the message service writes the message to a durable per-conversation log (assigning the sequence number) before acking the sender; then it fans out to recipients. The log is the source of truth — delivery is "has this device's read cursor advanced past this sequence number," which makes offline delivery just "catch up from your cursor on reconnect."
3. Deep-dive: the connection/fan-out layer and offline delivery. This is where chat is actually hard.
Routing to a live socket. Recipient B's socket lives on exactly one of N gateway nodes. So the message service needs a session registry (B's device → which gateway) — typically Redis, updated on connect/disconnect — and it pushes the message to that gateway, which writes it down the socket. If B has three devices on three gateways, that's three pushes.
Fan-out, committed by group size. For 1:1 and small groups I'd fan-out-on-write: on send, look up each recipient's gateway and push. Simple, low read-latency. But for a 50k-member channel, fanning out 50k pushes per message is a write amplification disaster. So above a threshold I switch that channel to fan-out-on-read: the message lands in the channel's log once, and members pull new messages from the log when their client asks (or via a single subscribe to the channel's stream). The cost of on-read is a little more read-side latency; the cost of on-write at huge fan-out is melting your gateways — so the group-size threshold is the real design knob.
Offline delivery, which is the actual complaint. Because the log is the source of truth and each device has a read cursor, "offline" is not special: when B's client reconnects, it sends its last-seen sequence per conversation and the message service streams everything after it, in order. No message is "lost in transit" because nothing was ever only in transit — it was durably in the log before the sender got its ack. The live socket push is purely a latency optimization on top of the cursor-based catch-up.
Presence. Each connected device heartbeats into Redis with a TTL; "online" = "key present." Presence is best-effort and intentionally not on the message durability path — a stale presence entry must never cause a dropped message, because delivery is driven by the log + cursor, not by "are they online."
4. A committed trade-off and its cost. I'd commit to at-least-once delivery with client-side idempotent dedup (every message carries a stable client-generated id; the receiver ignores duplicates). The cost I name out loud: clients will occasionally receive a duplicate (a redelivery after a missed ack), so every client must dedup by message id — I'm pushing a small amount of complexity to the edge in exchange for never silently dropping a message. I deliberately reject chasing exactly-once on the network, because the honest version of exactly-once is at-least-once + idempotency anyway.
5. Operational concerns / injected failure. Failure you're about to hand me: a whole gateway (or region) goes down, dropping thousands of live sockets. Because durability lives in the log, not the socket, no acked message is lost. What the user sees is a dropped connection; clients reconnect (with backoff) to a healthy gateway, re-register in the session registry, and replay from their cursor — so the failure degrades to "a brief reconnect + catch-up," not data loss. I'd detect it on gateway connection-count cliffs and reconnect-storm metrics, and protect the message service from the reconnect thundering-herd with jittered backoff and connection rate-limiting. The presence layer will briefly show ghosts (TTL not yet expired) — acceptable, because presence is best-effort by design. Rollback for a bad gateway deploy: drain connections off the bad version (clients reconnect elsewhere) rather than hard-killing sockets.
- Turned 'reliable' into an explicit contract: at-least-once + per-conversation ordering + idempotent dedup
- Made the durable log the source of truth so offline delivery is just cursor catch-up, not a special case
- Switched fan-out strategy by group size and named the cost of each (write amplification vs read latency)
- Committed to at-least-once and pushed dedup to the client deliberately, rejecting network exactly-once
- Reasoned about a region/gateway loss unprompted and showed why no acked message is lost
- 'A message is sent to a 50k-member channel — do you fan out 50k pushes?' → no, switch that channel to fan-out-on-read above a size threshold
- 'How does an offline user get the message they missed?' → per-device read cursor; on reconnect, stream everything after the cursor from the durable log
- 'How do you guarantee order?' → monotonic per-conversation sequence assigned at write, before ack; clients apply in sequence order
- 'A gateway dies with 100k sockets on it — what's lost?' → nothing acked; clients reconnect, re-register, and replay from cursor
- 'Exactly-once?' → don't promise it on the wire; at-least-once + idempotent message ids is the honest version