system-design · level 4

Consistency Models

Strong, eventual, causal, read-your-writes — and what CAP/PACELC actually say.

150 XP

Consistency Models

The moment you replicate data across more than one machine, "what does this read return?" stops having an obvious answer. Consistency models are the specifications that turn that question into something you can reason about — and pay for.

The CAP theorem (and what it does not say)

CAP states: in a distributed system, when a network partition happens, you must choose between consistency and availability. You cannot have both during the partition.

Things people commonly get wrong about CAP:

The C in CAP is linearizability, not the C in ACID. ACID's C is "the database enforces invariants" (foreign keys, check constraints). CAP's C is "every replica looks like a single up-to-date copy of the data". Different concepts.
CAP is only about partitions. The "trilemma" framing is misleading — partitions are real but rare. What you actually face every day is the latency-vs-consistency trade-off.
CP doesn't mean "always consistent". It means "during a partition, prefer consistency over availability". The system is still highly available outside partitions.

PACELC — the more useful framing

Daniel Abadi pointed out that CAP only describes the partition case. Most of the time there is no partition, and you still face a choice:

If a Partition occurs, choose between Availability and Consistency. Else (no partition), choose between Latency and Consistency.

Strong consistency requires multiple replicas to agree before returning. That's slow even when nothing is broken — typical strong-read latency on a 3-replica system is 5–20× a local-read latency. PACELC says: even on a perfectly healthy day, you pay for consistency in latency.

Model	P case	non-P case
DynamoDB (default)	A	L
DynamoDB (strong reads)	C	C
Cassandra (QUORUM)	C	C
Spanner	C	C
MongoDB (primary read)	C	C

The four consistency models you must know

Strong consistency (linearizability)

Every read sees the most-recent committed write, period. The system behaves as if there's a single copy of the data.

t0: client A writes x=5
t1: client B reads x → returns 5 (always)

How it's implemented: synchronous replication (write to majority before ack), or single-leader writes with read-from-leader. Either way, you're paying for cross-replica round trips on every operation.

When to use: bank balances during a transaction, leader-election state, distributed locks, anything where stale = catastrophic.

Eventual consistency

If no new writes happen, all replicas eventually converge to the same value. "Eventually" is unbounded — could be milliseconds, could be minutes during a partition.

t0: leader writes x=5; replicates to followers async
t1: read from follower-1 → still returns x=4 (stale)
t2: replication catches up
t3: read from follower-1 → now returns x=5

How it's implemented: async replication, gossip, anti-entropy. The fastest possible model.

When to use: feed counts, like counts, last-active timestamps, search indexes, anything where 'stale by 1 second' is invisible.

Read-your-writes (a.k.a. read-after-write, session consistency)

Within a single session, you always see your own writes. Other sessions may see stale data.

client A writes x=5
client A reads x → returns 5 (guaranteed)
client B reads x → may still see x=4

How it's implemented:

Sticky reads — route a session to whichever replica it wrote to.
Versioned reads — tag every write with a logical clock; reads from the same session must catch up to that clock before returning.
Read-from-leader for the writer — your reads bypass the read replicas only if you've written recently.

When to use: user-facing apps where 'I changed my profile and it didn't update' is the worst possible bug. Display names, comment threads, settings pages.

Causal consistency

If write A happens-before write B (B was caused by A), then any reader sees them in that order. Concurrent writes can be observed in any order.

A writes x=5  →  B reads x=5, then writes y=10
A reader who sees y=10 must also see x=5.
A reader who sees x=4 may or may not see y=10 (but if they do, it's a bug).

How it's implemented: vector clocks, version vectors. More complex than eventual, less expensive than strong. Used in chat apps where "Bob can't see Alice's reply before Alice's question."

Quorum reads & writes

In an N-replica system, a quorum is a subset of replicas that any operation must contact.

W = writes must hit W replicas before being ack'd
R = reads must consult R replicas and pick the latest

The magic invariant: R + W > N. If reads and writes always overlap on at least one replica, that replica has the latest write — strict consistency without strong-leader bottlenecks.

N	W	R	Behaviour
3	1	1	Eventual; fast everywhere
3	2	2	Quorum (R+W=4>3); strict reads
3	3	1	Strict reads, slow writes
3	1	3	Strict reads, slow reads, fast writes

Cassandra and DynamoDB both expose R and W as knobs per query. You spend latency to buy consistency, on a per-operation basis.

Linearizability vs serializability

Often confused.

Linearizability is about single-object operations. Every read sees the result of the most-recent write that returned. Single-object atomic broadcast.
Serializability is about transactions. Concurrent transactions produce a result equivalent to some serial order.

You can have one without the other:

A DB that's serializable but not linearizable: read-only transactions might see a slightly-old snapshot, but transaction isolation is still strict (Postgres in some configurations).
A K/V that's linearizable but not serializable: each GET/PUT is atomic, but you can't span a multi-key transaction.

Strict serializability = both. Spanner gives you this globally. Most other systems give you one or the other.

The stale-read problem

Every async-replication architecture has it. Leader writes x=5; follower-1 hasn't applied it yet; client reads from follower-1 and sees x=4. Bug? Or feature?

Tolerable when:

The data is read-mostly and a few seconds of staleness is invisible (display names, like counts).
The user isn't the writer (someone else's profile update).

Catastrophic when:

Read-after-write: user changes setting, page reloads from stale follower, change appears reverted.
Cross-key invariants: transactions table updated, account_balance row read from stale replica, balance looks wrong.
Auth flows: token rotated on leader, follower still serves the old token.

Mitigations: read-from-leader for sensitive paths, sticky-session routing, version-aware reads, or just use a consistent model for that specific operation (DynamoDB's ConsistentRead: true).

Picking a model

A reasonable default progression:

Eventual + read-from-leader for the writer's session. This is read-your-writes, the right default for ~90% of user-facing reads.
Strong (linearizable) reads for write-after-read paths that span sessions: leader-election, distributed locks, balance checks.
Quorum reads when you need strictness without leader-only bottleneck (Cassandra, DynamoDB).
Causal when you have explicit happens-before relationships (chat apps, collaborative docs).

Everything else (ZooKeeper-style sequential consistency, snapshot isolation, monotonic reads) is a refinement of these.

Real-world picks

Spanner / CockroachDB — external consistency / serializable. Default to strict; you pay for it in latency on writes.
DynamoDB — eventual by default; flip a per-operation flag for strict reads (2× cost).
Cassandra / ScyllaDB — tunable consistency per query (ONE, QUORUM, ALL).
MongoDB — primary reads strict; secondary reads eventual.
Postgres + read replicas — leader is strict; replicas are eventually consistent. Many teams forget this; many bugs follow.

The interview answer

When someone asks "should we use eventual or strong consistency for X?", the framework that almost never fails:

What breaks if a read returns stale data? Define "breaks" concretely.
How long can stale persist? Milliseconds? Seconds? Minutes?
What's the read-volume vs write-volume? Strong consistency taxes reads; eventual taxes you when stale data leaks out.
What's the latency budget? Strong consistency costs 5–20ms minimum, often more.

Then: pick the loosest model that survives those answers. Stronger is always available; you upgrade individual operations rather than the whole system.

The most expensive mistake is the inverse — defaulting to strong, suffering the latency, then weakening it under load when the system is already shipping.

Tools in the wild

4 tools

DynamoDB
AWS managed K/V. Eventual reads default; strong reads on request (2× cost).
service
Spanner
Google's globally-distributed SQL DB. External consistency via TrueTime.
service
CockroachDBfree tier
Open-source distributed SQL. Serializable isolation by default.
service
Cassandrafree tier
Tunable consistency per query (ONE / QUORUM / ALL). The original AP-with-knobs DB.
service