cloud · level 7

Event-Driven Architectures

SQS vs SNS vs EventBridge vs Kinesis vs MSK — picking from the messaging zoo.

200 XP

Event-Driven Architectures

The messaging zoo. Five AWS services that look superficially similar — they all involve "send a message, somebody receives it" — but model fundamentally different shapes. Pick wrong and you'll spend a quarter rebuilding when traffic shifts.

Analogy

Imagine the post office. SQS is registered mail to a single PO box: someone collects each parcel exactly once, signs for it, and the post office removes it from the queue. SNS is a mailing-list broadcast: send one bulletin and every subscriber gets their own copy. EventBridge is the central sorting office: events arrive from many sources, get routed by stamps and labels to the right destinations. Kinesis / MSK is the newspaper morgue: every issue is filed in chronological order, and any researcher can come back later and read from any past date — the paper itself is never destroyed.

The three shapes

Most "messaging" reduces to one of these:

Queue (point-to-point)

One sender. A pool of consumers. Each message goes to exactly one consumer, gets ack'd, and is removed.

producer ──▶ [queue] ──▶ consumer

Use cases: background jobs, work distribution, "process this once and only once".

Pub/Sub (fan-out)

One sender. Many subscribers. Each subscriber gets its own copy of every message.

                 ┌──▶ subscriber-A
producer ──▶ [topic] ──▶ subscriber-B
                 └──▶ subscriber-C

Use cases: "user signed up — email, billing, analytics, audit-log all need to know."

Log (replayable stream)

Records appended in order. Consumers read at their own offset; the log keeps the records for a configurable retention. Multiple independent consumers, each with their own progress.

producer ──▶ [log: r1, r2, r3, r4, ...]
                ▲   ▲
                │   └─ consumer-B at offset 2
                └─── consumer-A at offset 0 (replaying history)

Use cases: clickstreams, change-data-capture, multiple downstream materialised views.

The AWS messaging zoo

Service Shape Ordering Throughput Retention
SQS queue best-effort or per-group (FIFO) unbounded up to 14 days
SQS FIFO queue per-MessageGroupId 300/s default up to 14 days
SNS pub/sub none unbounded none — push-then-forget
EventBridge router best-effort unbounded rule-based
Kinesis log per-shard per-shard quota 1–365 days
MSK (Kafka) log per-partition configurable configurable

SQS — the queue you reach for first

The right answer for almost any "background job" pattern. Producer pushes; consumers poll; one consumer wins each message; ack-then-delete.

Two flavours:

  • Standard SQS: best-effort ordering, at-least-once delivery, unbounded throughput.
  • FIFO SQS: strict per-MessageGroupId ordering, exactly-once, capped at 300 msg/s default (3000 with batching).

Production patterns:

  • Visibility timeout: consumer pulls a message; SQS hides it for N seconds; consumer must DeleteMessage before the timeout or it reappears. Tune to "longest possible processing time × 2".
  • Long polling (WaitTimeSeconds: 20) — avoids tight loops; reduces empty ReceiveMessage charges.
  • Dead-letter queue (DLQ): poison-pill protection. After N failed receives, move to a DLQ for inspection.

SNS — pub/sub the AWS way

The canonical pattern: SNS topic ⇒ many SQS queues. Each subscriber gets its own queue with its own DLQ, retry policy, and processing speed.

                    ┌──▶ [SQS welcome-emails] ──▶ Lambda
SNS user-signups ──▶│
                    └──▶ [SQS billing-records] ──▶ ECS worker

This is the default fan-out pattern in production AWS. Don't deliver SNS directly to Lambda for high-fan-out cases — you lose retry control. Always go SNS → SQS → consumer.

EventBridge — the routing bus

EventBridge is a message router with rules and schemas. The shape:

  1. Sources publish events: AWS services (S3, EC2 state changes, GuardDuty findings), SaaS partners (Stripe, Datadog, GitHub), or your own apps via PutEvents.
  2. Rules match events by JSON-pattern, optionally transform them.
  3. Targets receive matched events: Lambda, SQS, SNS, Step Functions, API destinations.

Why EventBridge over SNS:

  • Built-in source integrations. S3 → EventBridge is a single setting; S3 → SNS is more setup.
  • Pattern matching. Filter events by content; deliver only matches. SNS now has filter policies but EventBridge is more expressive.
  • Schema registry, archive + replay, API destinations for SaaS targets.

Why SNS over EventBridge: throughput, latency. SNS is in the millions/sec; EventBridge has lower per-bus quotas.

Kinesis vs MSK — the log shape

Both are ordered, replayable logs. The differences:

  • Kinesis Data Streams is AWS-native, managed-shards, AWS SDK clients. Lower ops cost, but locks you into AWS.
  • MSK is managed Apache Kafka. Your existing Kafka producers and consumers (Java, Go, Python) keep working. AWS runs the brokers, ZooKeeper / KRaft, the rebalancing.

When to pick MSK:

  • You already have Kafka clients and don't want to rewrite.
  • You need per-partition throughput beyond Kinesis's per-shard limits.
  • You need Kafka-only features: tiered storage (3.6+), Kafka Connect, exactly-once across topics.

When to pick Kinesis:

  • AWS-only stack, fewer ops, simpler IAM.
  • Lower volume, where the per-shard pricing is simpler than per-broker.

There's also Kinesis Firehose — a different thing: a managed delivery service from sources (Kinesis, direct PUT) into sinks (S3, Redshift, OpenSearch). No consumer code; you just deliver into the warehouse. Use it for 'send my logs to S3' with batching and compression.

Picking the right service — a flowchart

need to fan out to N consumers, each gets a copy?
├── yes → SNS (or EventBridge if you need filtering / SaaS sources)
└── no
    └── one consumer pool, work distribution?
        ├── yes → SQS (FIFO if per-group ordering matters)
        └── no
            └── replayable log of historical records?
                ├── yes
                │   ├── already on Kafka → MSK
                │   └── AWS-native, lighter ops → Kinesis Data Streams
                └── no
                    └── direct delivery to S3/Redshift/OpenSearch → Firehose

Cross-cutting patterns

At-least-once vs exactly-once

Most message systems are at-least-once (SQS standard, Kinesis, Kafka by default). The consumer must be idempotent — processing the same message twice produces the same result. Idempotency is your problem; the message system is honest about its retries.

SQS FIFO and Kafka with transactions can give you exactly-once within their boundary. Across boundaries (Kafka → Postgres write), you still need idempotency keys.

Dead-letter queues

Every queue, every Lambda invocation, every consumer should have a DLQ. After N retries, the poison pill goes to the DLQ — where you can inspect it, reproduce the issue, fix the bug, and replay if needed. DLQs have prevented more outages than they've caused; they're the cheapest insurance in messaging.

Retries and backoff

The standard shape: exponential backoff with jitter, capped at a sensible max delay (often 60s), max retries 5–10. SQS, Lambda, EventBridge all have configurable retry policies. Use them; don't reimplement in user code.

Common bugs

Using SQS standard when you need ordering. Standard is best-effort; you can get duplicates and out-of-order. Use FIFO with MessageGroupId if you genuinely need per-key ordering.

Forgetting to delete the message. SQS doesn't auto-ack. If your consumer hangs, the message reappears. Your code must DeleteMessage after successful processing — and only after.

Visibility timeout too short. If processing takes 60s and visibility is 30s, the message reappears mid-processing and gets picked up by another worker. Set visibility >> processing time, or extend it dynamically with ChangeMessageVisibility.

Fanning out via N SNS topics instead of one + N filter rules. N topics = N permission policies = pain. One topic with filter policies (or EventBridge with rules) scales much better.

Choosing Kafka because it's cool. Kafka is operationally heavy. If your team isn't ready to operate brokers, ZooKeeper / KRaft, partition rebalancing, MSK helps but doesn't eliminate the work. SQS+SNS handles 90% of "we need messaging" cases at a fraction of the operational cost.

Tools in the wild

6 tools
  • Managed message queue (standard + FIFO). Default for work distribution.

    service
  • Pub/sub topics; fan out to SQS, Lambda, HTTPS, email, SMS.

    service
  • Event bus with rules, schemas, and SaaS-partner integrations.

    service
  • Ordered, replayable log; per-shard ordering; consumer-managed offsets.

    service
  • Managed Kafka — your existing Kafka producers/consumers, AWS runs the brokers.

    service
  • Coordinator that orchestrates events into workflows with retry/timeout per state.

    service