Event-Driven Architectures
SQS vs SNS vs EventBridge vs Kinesis vs MSK — picking from the messaging zoo.
Event-Driven Architectures
The messaging zoo. Five AWS services that look superficially similar — they all involve "send a message, somebody receives it" — but model fundamentally different shapes. Pick wrong and you'll spend a quarter rebuilding when traffic shifts.
Analogy
Imagine the post office. SQS is registered mail to a single PO box: someone collects each parcel exactly once, signs for it, and the post office removes it from the queue. SNS is a mailing-list broadcast: send one bulletin and every subscriber gets their own copy. EventBridge is the central sorting office: events arrive from many sources, get routed by stamps and labels to the right destinations. Kinesis / MSK is the newspaper morgue: every issue is filed in chronological order, and any researcher can come back later and read from any past date — the paper itself is never destroyed.
The three shapes
Most "messaging" reduces to one of these:
Queue (point-to-point)
One sender. A pool of consumers. Each message goes to exactly one consumer, gets ack'd, and is removed.
producer ──▶ [queue] ──▶ consumer
Use cases: background jobs, work distribution, "process this once and only once".
Pub/Sub (fan-out)
One sender. Many subscribers. Each subscriber gets its own copy of every message.
┌──▶ subscriber-A
producer ──▶ [topic] ──▶ subscriber-B
└──▶ subscriber-C
Use cases: "user signed up — email, billing, analytics, audit-log all need to know."
Log (replayable stream)
Records appended in order. Consumers read at their own offset; the log keeps the records for a configurable retention. Multiple independent consumers, each with their own progress.
producer ──▶ [log: r1, r2, r3, r4, ...]
▲ ▲
│ └─ consumer-B at offset 2
└─── consumer-A at offset 0 (replaying history)
Use cases: clickstreams, change-data-capture, multiple downstream materialised views.
The AWS messaging zoo
| Service | Shape | Ordering | Throughput | Retention |
|---|---|---|---|---|
| SQS | queue | best-effort or per-group (FIFO) | unbounded | up to 14 days |
| SQS FIFO | queue | per-MessageGroupId | 300/s default | up to 14 days |
| SNS | pub/sub | none | unbounded | none — push-then-forget |
| EventBridge | router | best-effort | unbounded | rule-based |
| Kinesis | log | per-shard | per-shard quota | 1–365 days |
| MSK (Kafka) | log | per-partition | configurable | configurable |
SQS — the queue you reach for first
The right answer for almost any "background job" pattern. Producer pushes; consumers poll; one consumer wins each message; ack-then-delete.
Two flavours:
- Standard SQS: best-effort ordering, at-least-once delivery, unbounded throughput.
- FIFO SQS: strict per-
MessageGroupIdordering, exactly-once, capped at 300 msg/s default (3000 with batching).
Production patterns:
- Visibility timeout: consumer pulls a message; SQS hides it for N seconds; consumer must
DeleteMessagebefore the timeout or it reappears. Tune to "longest possible processing time × 2". - Long polling (
WaitTimeSeconds: 20) — avoids tight loops; reduces emptyReceiveMessagecharges. - Dead-letter queue (DLQ): poison-pill protection. After N failed receives, move to a DLQ for inspection.
SNS — pub/sub the AWS way
The canonical pattern: SNS topic ⇒ many SQS queues. Each subscriber gets its own queue with its own DLQ, retry policy, and processing speed.
┌──▶ [SQS welcome-emails] ──▶ Lambda
SNS user-signups ──▶│
└──▶ [SQS billing-records] ──▶ ECS worker
This is the default fan-out pattern in production AWS. Don't deliver SNS directly to Lambda for high-fan-out cases — you lose retry control. Always go SNS → SQS → consumer.
EventBridge — the routing bus
EventBridge is a message router with rules and schemas. The shape:
- Sources publish events: AWS services (S3, EC2 state changes, GuardDuty findings), SaaS partners (Stripe, Datadog, GitHub), or your own apps via
PutEvents. - Rules match events by JSON-pattern, optionally transform them.
- Targets receive matched events: Lambda, SQS, SNS, Step Functions, API destinations.
Why EventBridge over SNS:
- Built-in source integrations. S3 → EventBridge is a single setting; S3 → SNS is more setup.
- Pattern matching. Filter events by content; deliver only matches. SNS now has filter policies but EventBridge is more expressive.
- Schema registry, archive + replay, API destinations for SaaS targets.
Why SNS over EventBridge: throughput, latency. SNS is in the millions/sec; EventBridge has lower per-bus quotas.
Kinesis vs MSK — the log shape
Both are ordered, replayable logs. The differences:
- Kinesis Data Streams is AWS-native, managed-shards, AWS SDK clients. Lower ops cost, but locks you into AWS.
- MSK is managed Apache Kafka. Your existing Kafka producers and consumers (Java, Go, Python) keep working. AWS runs the brokers, ZooKeeper / KRaft, the rebalancing.
When to pick MSK:
- You already have Kafka clients and don't want to rewrite.
- You need per-partition throughput beyond Kinesis's per-shard limits.
- You need Kafka-only features: tiered storage (3.6+), Kafka Connect, exactly-once across topics.
When to pick Kinesis:
- AWS-only stack, fewer ops, simpler IAM.
- Lower volume, where the per-shard pricing is simpler than per-broker.
There's also Kinesis Firehose — a different thing: a managed delivery service from sources (Kinesis, direct PUT) into sinks (S3, Redshift, OpenSearch). No consumer code; you just deliver into the warehouse. Use it for 'send my logs to S3' with batching and compression.
Picking the right service — a flowchart
need to fan out to N consumers, each gets a copy?
├── yes → SNS (or EventBridge if you need filtering / SaaS sources)
└── no
└── one consumer pool, work distribution?
├── yes → SQS (FIFO if per-group ordering matters)
└── no
└── replayable log of historical records?
├── yes
│ ├── already on Kafka → MSK
│ └── AWS-native, lighter ops → Kinesis Data Streams
└── no
└── direct delivery to S3/Redshift/OpenSearch → Firehose
Cross-cutting patterns
At-least-once vs exactly-once
Most message systems are at-least-once (SQS standard, Kinesis, Kafka by default). The consumer must be idempotent — processing the same message twice produces the same result. Idempotency is your problem; the message system is honest about its retries.
SQS FIFO and Kafka with transactions can give you exactly-once within their boundary. Across boundaries (Kafka → Postgres write), you still need idempotency keys.
Dead-letter queues
Every queue, every Lambda invocation, every consumer should have a DLQ. After N retries, the poison pill goes to the DLQ — where you can inspect it, reproduce the issue, fix the bug, and replay if needed. DLQs have prevented more outages than they've caused; they're the cheapest insurance in messaging.
Retries and backoff
The standard shape: exponential backoff with jitter, capped at a sensible max delay (often 60s), max retries 5–10. SQS, Lambda, EventBridge all have configurable retry policies. Use them; don't reimplement in user code.
Common bugs
Using SQS standard when you need ordering. Standard is best-effort; you can get duplicates and out-of-order. Use FIFO with MessageGroupId if you genuinely need per-key ordering.
Forgetting to delete the message. SQS doesn't auto-ack. If your consumer hangs, the message reappears. Your code must DeleteMessage after successful processing — and only after.
Visibility timeout too short. If processing takes 60s and visibility is 30s, the message reappears mid-processing and gets picked up by another worker. Set visibility >> processing time, or extend it dynamically with ChangeMessageVisibility.
Fanning out via N SNS topics instead of one + N filter rules. N topics = N permission policies = pain. One topic with filter policies (or EventBridge with rules) scales much better.
Choosing Kafka because it's cool. Kafka is operationally heavy. If your team isn't ready to operate brokers, ZooKeeper / KRaft, partition rebalancing, MSK helps but doesn't eliminate the work. SQS+SNS handles 90% of "we need messaging" cases at a fraction of the operational cost.
Tools in the wild
6 tools- service
Managed message queue (standard + FIFO). Default for work distribution.
- service
Pub/sub topics; fan out to SQS, Lambda, HTTPS, email, SMS.
- service
Event bus with rules, schemas, and SaaS-partner integrations.
- service
Ordered, replayable log; per-shard ordering; consumer-managed offsets.
- service
Managed Kafka — your existing Kafka producers/consumers, AWS runs the brokers.
- service
Coordinator that orchestrates events into workflows with retry/timeout per state.