cloud · level 7

Event-Driven Architectures

SQS vs SNS vs EventBridge vs Kinesis vs MSK — picking from the messaging zoo.

200 XP

Event-Driven Architectures

The messaging zoo. Five AWS services that look superficially similar — they all involve "send a message, somebody receives it" — but model fundamentally different shapes. Pick wrong and you'll spend a quarter rebuilding when traffic shifts.

Analogy

Imagine the post office. SQS is registered mail to a single PO box: someone collects each parcel exactly once, signs for it, and the post office removes it from the queue. SNS is a mailing-list broadcast: send one bulletin and every subscriber gets their own copy. EventBridge is the central sorting office: events arrive from many sources, get routed by stamps and labels to the right destinations. Kinesis / MSK is the newspaper morgue: every issue is filed in chronological order, and any researcher can come back later and read from any past date — the paper itself is never destroyed.

The three shapes

Most "messaging" reduces to one of these:

Queue (point-to-point)

One sender. A pool of consumers. Each message goes to exactly one consumer, gets ack'd, and is removed.

producer ──▶ [queue] ──▶ consumer

Use cases: background jobs, work distribution, "process this once and only once".

Pub/Sub (fan-out)

One sender. Many subscribers. Each subscriber gets its own copy of every message.

                 ┌──▶ subscriber-A
producer ──▶ [topic] ──▶ subscriber-B
                 └──▶ subscriber-C

Use cases: "user signed up — email, billing, analytics, audit-log all need to know."

Log (replayable stream)

Records appended in order. Consumers read at their own offset; the log keeps the records for a configurable retention. Multiple independent consumers, each with their own progress.

producer ──▶ [log: r1, r2, r3, r4, ...]
                ▲   ▲
                │   └─ consumer-B at offset 2
                └─── consumer-A at offset 0 (replaying history)

Use cases: clickstreams, change-data-capture, multiple downstream materialised views.

The AWS messaging zoo

Service	Shape	Ordering	Throughput	Retention
SQS	queue	best-effort or per-group (FIFO)	unbounded	up to 14 days
SQS FIFO	queue	per-MessageGroupId	300/s default	up to 14 days
SNS	pub/sub	none	unbounded	none — push-then-forget
EventBridge	router	best-effort	unbounded	rule-based
Kinesis	log	per-shard	per-shard quota	1–365 days
MSK (Kafka)	log	per-partition	configurable	configurable

SQS — the queue you reach for first

The right answer for almost any "background job" pattern. Producer pushes; consumers poll; one consumer wins each message; ack-then-delete.

Two flavours:

Standard SQS: best-effort ordering, at-least-once delivery, unbounded throughput.
FIFO SQS: strict per-MessageGroupId ordering, exactly-once, capped at 300 msg/s default (3000 with batching).

Production patterns:

Visibility timeout: consumer pulls a message; SQS hides it for N seconds; consumer must DeleteMessage before the timeout or it reappears. Tune to "longest possible processing time × 2".
Long polling (WaitTimeSeconds: 20) — avoids tight loops; reduces empty ReceiveMessage charges.
Dead-letter queue (DLQ): poison-pill protection. After N failed receives, move to a DLQ for inspection.

SNS — pub/sub the AWS way

The canonical pattern: SNS topic ⇒ many SQS queues. Each subscriber gets its own queue with its own DLQ, retry policy, and processing speed.

                    ┌──▶ [SQS welcome-emails] ──▶ Lambda
SNS user-signups ──▶│
                    └──▶ [SQS billing-records] ──▶ ECS worker

This is the default fan-out pattern in production AWS. Don't deliver SNS directly to Lambda for high-fan-out cases — you lose retry control. Always go SNS → SQS → consumer.

EventBridge — the routing bus

EventBridge is a message router with rules and schemas. The shape:

Sources publish events: AWS services (S3, EC2 state changes, GuardDuty findings), SaaS partners (Stripe, Datadog, GitHub), or your own apps via PutEvents.
Rules match events by JSON-pattern, optionally transform them.
Targets receive matched events: Lambda, SQS, SNS, Step Functions, API destinations.

Why EventBridge over SNS:

Built-in source integrations. S3 → EventBridge is a single setting; S3 → SNS is more setup.
Pattern matching. Filter events by content; deliver only matches. SNS now has filter policies but EventBridge is more expressive.
Schema registry, archive + replay, API destinations for SaaS targets.

Why SNS over EventBridge: throughput, latency. SNS is in the millions/sec; EventBridge has lower per-bus quotas.

Kinesis vs MSK — the log shape

Both are ordered, replayable logs. The differences:

Kinesis Data Streams is AWS-native, managed-shards, AWS SDK clients. Lower ops cost, but locks you into AWS.
MSK is managed Apache Kafka. Your existing Kafka producers and consumers (Java, Go, Python) keep working. AWS runs the brokers, ZooKeeper / KRaft, the rebalancing.

When to pick MSK:

You already have Kafka clients and don't want to rewrite.
You need per-partition throughput beyond Kinesis's per-shard limits.
You need Kafka-only features: tiered storage (3.6+), Kafka Connect, exactly-once across topics.

When to pick Kinesis:

AWS-only stack, fewer ops, simpler IAM.
Lower volume, where the per-shard pricing is simpler than per-broker.

There's also Kinesis Firehose — a different thing: a managed delivery service from sources (Kinesis, direct PUT) into sinks (S3, Redshift, OpenSearch). No consumer code; you just deliver into the warehouse. Use it for 'send my logs to S3' with batching and compression.

Picking the right service — a flowchart

need to fan out to N consumers, each gets a copy?
├── yes → SNS (or EventBridge if you need filtering / SaaS sources)
└── no
    └── one consumer pool, work distribution?
        ├── yes → SQS (FIFO if per-group ordering matters)
        └── no
            └── replayable log of historical records?
                ├── yes
                │   ├── already on Kafka → MSK
                │   └── AWS-native, lighter ops → Kinesis Data Streams
                └── no
                    └── direct delivery to S3/Redshift/OpenSearch → Firehose

Cross-cutting patterns

At-least-once vs exactly-once

Most message systems are at-least-once (SQS standard, Kinesis, Kafka by default). The consumer must be idempotent — processing the same message twice produces the same result. Idempotency is your problem; the message system is honest about its retries.

SQS FIFO and Kafka with transactions can give you exactly-once within their boundary. Across boundaries (Kafka → Postgres write), you still need idempotency keys.

Dead-letter queues

Every queue, every Lambda invocation, every consumer should have a DLQ. After N retries, the poison pill goes to the DLQ — where you can inspect it, reproduce the issue, fix the bug, and replay if needed. DLQs have prevented more outages than they've caused; they're the cheapest insurance in messaging.

Retries and backoff

The standard shape: exponential backoff with jitter, capped at a sensible max delay (often 60s), max retries 5–10. SQS, Lambda, EventBridge all have configurable retry policies. Use them; don't reimplement in user code.

Common bugs

Using SQS standard when you need ordering. Standard is best-effort; you can get duplicates and out-of-order. Use FIFO with MessageGroupId if you genuinely need per-key ordering.

Forgetting to delete the message. SQS doesn't auto-ack. If your consumer hangs, the message reappears. Your code must DeleteMessage after successful processing — and only after.

Visibility timeout too short. If processing takes 60s and visibility is 30s, the message reappears mid-processing and gets picked up by another worker. Set visibility >> processing time, or extend it dynamically with ChangeMessageVisibility.

Fanning out via N SNS topics instead of one + N filter rules. N topics = N permission policies = pain. One topic with filter policies (or EventBridge with rules) scales much better.

Choosing Kafka because it's cool. Kafka is operationally heavy. If your team isn't ready to operate brokers, ZooKeeper / KRaft, partition rebalancing, MSK helps but doesn't eliminate the work. SQS+SNS handles 90% of "we need messaging" cases at a fraction of the operational cost.

Tools in the wild

6 tools

Amazon SQS
Managed message queue (standard + FIFO). Default for work distribution.
service
Amazon SNS
Pub/sub topics; fan out to SQS, Lambda, HTTPS, email, SMS.
service
Amazon EventBridge
Event bus with rules, schemas, and SaaS-partner integrations.
service
Amazon Kinesis Data Streams
Ordered, replayable log; per-shard ordering; consumer-managed offsets.
service
Amazon MSK
Managed Kafka — your existing Kafka producers/consumers, AWS runs the brokers.
service
AWS Step Functions
Coordinator that orchestrates events into workflows with retry/timeout per state.
service