system-design · level 5

Message Queues & Event Streaming

Work queues, pub/sub, partitioned logs — and the delivery semantics each gives you.

150 XP

Message Queues & Event Streaming

Once you have more than one service, you need a way to pass work between them that doesn't require both to be online simultaneously. Queues and event streams are the answer. They turn synchronous coupling ("call this service and wait") into asynchronous coupling ("drop this message; some service will handle it eventually").

The category contains three quite different beasts. Mixing them up is one of the top architectural mistakes.

Three topologies

Work queue (point-to-point)

One producer, one queue, N competing consumers. Each message is delivered to exactly one consumer.

producer → [ queue: msg1, msg2, msg3, ... ] → ┬→ worker A
                                              ├→ worker B
                                              └→ worker C

This is the right shape for work distribution — process this order, send this email, resize this image. The consumers are interchangeable; the queue load-balances naturally.

Examples: AWS SQS, RabbitMQ in default mode, Sidekiq/Celery backed by Redis.

Pub/sub (fanout)

One producer, N subscribers. Each message is delivered to every subscriber.

producer → ┬→ subscriber A (gets every msg)
           ├→ subscriber B (gets every msg)
           └→ subscriber C (gets every msg)

This is the right shape for event broadcasting — "user signed up", "order placed", "config changed" — where multiple downstream systems each need to react independently.

Examples: Redis pub/sub, Google Pub/Sub, Kafka with multiple consumer groups, AWS SNS.

Partitioned log

Producer writes to a topic; the topic is split into N partitions. Each partition is an ordered, durable log. Consumers read from one or more partitions, tracking their own position (the offset).

producer (key=user-1) → [topic: orders]
                          partition-0: msg1, msg5, msg9, ...   ← consumer-A reads here
                          partition-1: msg2, msg7, ...          ← consumer-B reads here
                          partition-2: msg3, msg4, msg6, ...    ← consumer-C reads here

Same key always lands on the same partition (so all messages for user-1 are ordered with respect to each other). Different keys can be processed in parallel across partitions.

Examples: Apache Kafka, AWS Kinesis, Redis Streams, Pulsar.

The three delivery semantics

At-most-once

Send and forget. If the network drops the message, it's gone. No retries, no acks.

  • Pros: simple, lowest latency, no duplicates.
  • Cons: you can lose data.
  • When: telemetry, metrics, logs where 100% delivery isn't worth the complexity. UDP-style fire-and-forget.

At-least-once

Producer retries until acknowledged. Consumer acks after processing. If anything fails between processing and ack, the message gets redelivered → duplicates.

  • Pros: no data loss, simple to implement correctly.
  • Cons: consumers must handle duplicates (idempotency).
  • When: the default for almost everything. Order processing, email sending, billing — losing data is worse than handling a duplicate.

Exactly-once

Each message processed exactly one time, no losses, no duplicates. Sounds great, mostly a lie.

In its strictest sense, exactly-once across a distributed boundary is impossible (the consumer can crash between "processed" and "acked"). What people call exactly-once is one of:

  • Kafka's exactly-once-in-Kafka — transactional writes that span topic + offset commit. Works inside Kafka pipelines, not when you write to an external system.
  • Idempotent at-least-once — the consumer dedupes by message id. Functionally exactly-once from the application's perspective.

The honest design: at-least-once + idempotent consumers. Cheaper, simpler, and survives every failure mode.

Backpressure

When producers send faster than consumers process, something has to give. The four responses:

  1. Block the producer. Producer's send() call blocks until the queue has space. Slows the producer down to consumer speed. Most queues do this implicitly via TCP backpressure.
  2. Buffer in memory. Queue grows in RAM. Fast, but if the queue dies you lose everything (and OOM is a risk).
  3. Buffer to disk. Queue is durable, can absorb bursts. Slower, bounded by disk size. Kafka, RabbitMQ persistent queues.
  4. Drop messages. Queue rejects newer messages, or drops oldest. Last-resort, only for replaceable data.

Production queues are usually buffer-to-disk + alarm-on-depth-threshold. When the queue depth alarms, you scale consumers, fix the bug slowing them down, or drop noncritical traffic.

Dead-letter queues (DLQ)

Some messages will never succeed — bad data, code bug, deleted referenced records. After N retries, they need to come out of the main queue or they'll block it forever.

The pattern:

  1. Worker fails to process a message.
  2. Queue increments the message's retry counter.
  3. After N retries (typically 3–5), the queue moves the message to a separate dead-letter queue.
  4. Engineers inspect the DLQ, fix the bug, manually re-enqueue (or write the message off as bad data).

Every production queue should have a DLQ + alarm on its depth. Without one, a poison message can take down the whole pipeline.

Ordering guarantees

Topology Ordering guarantee
Work queue (general) None — competing consumers race
Work queue (FIFO mode, e.g. SQS-FIFO) Per message-group-id
Pub/sub None across subscribers
Partitioned log Total order within a partition

The big practical point: if you need messages for the same key processed in order (e.g., updates to the same user's row), put them on a topology that gives you per-key ordering — Kafka with key-based partitioning, or SQS-FIFO with message-group-id.

Kafka vs SQS — the big mental model

Both are "queues" in casual conversation, but they're architecturally different:

Property Kafka (log) SQS (work queue)
Storage Durable log; messages stay until retention expires Removed after consumer acks
Re-read Yes — reset the offset, replay history No — once acked, gone forever
Ordering Per partition Per message-group (FIFO mode)
Multiple consumers Yes — each consumer group has own offsets No — competing consumers
Failure model Crash → restart from offset Crash → message reappears after visibility timeout
Throughput Millions/sec Tens of thousands/sec
Operational complexity High (you operate Zookeeper or KRaft) Low (managed)

Pick Kafka when: you need replayability, multiple independent consumers, or very high throughput. Event-sourcing, CDC pipelines, analytics ingestion.

Pick SQS when: you want a simple work queue, AWS-native, no operational burden. Background jobs, async API processing.

Idempotency keys

Every message should carry an idempotency key (often just a UUID). Consumers track which keys they've processed; duplicates are dropped silently.

{
  "id": "msg-2026-04-28-uuid-1234",
  "type": "OrderPlaced",
  "data": { ... }
}

// consumer logic:
if (await processedKeys.has(msg.id)) return;
await processOrder(msg.data);
await processedKeys.add(msg.id);

The dedupe store can be a Redis set, a Postgres table, or the consumer's own state (e.g., Kafka transactions). Whatever store you pick, expire the keys after a reasonable retention (24h, 7d) so it doesn't grow forever.

The visibility-timeout trick

How does a competing-consumer queue know a message is being worked on without locking it forever? Visibility timeout.

consumer reads msg → message becomes invisible to other consumers for 30s
consumer processes successfully → consumer deletes the message
consumer crashes → 30s later, the message reappears for someone else

The timeout has to be longer than the worst-case processing time, or two workers will end up processing the same message. Tune it to (max processing time) + a buffer, and have the worker periodically extend it for genuinely-long jobs.

Picking the right tool

A reasonable progression:

  1. Background jobs in a single app — your language's stdlib (BullMQ for Node, Sidekiq for Ruby, Celery for Python) backed by Redis. No external infrastructure.
  2. Cross-service work distribution — SQS or RabbitMQ. Simple work queue with retries + DLQ.
  3. Event broadcasting — SNS + SQS fanout, or Redis pub/sub for ephemeral events, or Kafka if you also need durability.
  4. High-volume streaming + replay — Kafka. The complexity tax is real but unavoidable past a certain scale.
  5. Strict ordering on a key — Kafka with key-based partitioning, or SQS-FIFO.

Resist the temptation to start with Kafka. It's the right answer at scale, and the wrong answer for a 100-msg/sec feature.

Common production tools

  • AWS SQS / SNS — managed work queue + pub/sub, AWS-native, very simple.
  • Apache Kafka / Confluent Cloud / AWS MSK — durable log, high throughput, replay.
  • RabbitMQ — flexible broker, exchanges + queues + bindings.
  • Redis Streams — lightweight log inside Redis. Good for moderate throughput.
  • Google Pub/Sub — managed pub/sub at-least-once with ordering keys.
  • AWS Kinesis — managed Kafka-alike on AWS, simpler than self-hosted Kafka.

The interview answer

When asked "should we use a queue here?", run through:

  1. Synchronous or async? If the caller can wait, you might not need a queue.
  2. Work distribution or event broadcasting? Different topologies.
  3. Replayability needed? That's a log, not a work queue.
  4. Ordering matters? Per-key (partitioned log) or globally (single FIFO queue)?
  5. What semantic? At-least-once + idempotent is almost always the answer.
  6. What's the failure mode? Visibility timeout + DLQ + alarms.

Get those right and the rest is plumbing.

Tools in the wild

4 tools
  • Apache Kafkafree tier

    Durable partitioned log. Consumer offsets, exactly-once-in-Kafka, the streaming standard.

    service
  • Managed work queue. Visibility timeouts, DLQ, FIFO mode for ordering. The default for AWS.

    service
  • RabbitMQfree tier

    Classic broker. Exchanges + queues + bindings. Work queues, pub/sub, RPC patterns.

    service
  • Redis Streamsfree tier

    Lightweight Kafka-style log inside Redis. Consumer groups, ack, DLQ via XCLAIM.

    service