system-design · level 8

Replication & Consensus

Single-leader, multi-leader, leaderless — and Raft holding it all together.

150 XP

Replication & Consensus

Replication is how a system survives node death without losing data, and how it serves more reads than a single machine can handle. Consensus is the math you bring in when multiple nodes have to agree on a single answer despite the network lying to them.

Together these are the foundation that lets you say "the database can't go down" with a straight face. The trade-offs between latency, consistency, and availability are real and ugly, and every production system has chosen — explicitly or by accident — where on that triangle it sits.

Why replicate

Three reasons, in order of importance:

  1. Durability — a single disk dies. With replicas, a node death is a non-event.
  2. Availability — read traffic can be served from any replica, so total read throughput scales horizontally.
  3. Latency — a replica close to the client serves reads faster than a round trip to a distant primary.

What replication does NOT give you for free: write scaling. Almost all replication topologies serialise writes through a single point (the leader). To scale writes you need sharding, which is a separate axis.

Three topologies

Single-leader (a.k.a. leader-follower, primary-replica)

One node is the leader. All writes go to the leader; the leader streams its write-ahead log to followers. Reads can be served from the leader (always fresh) or from followers (eventually consistent).

        ┌── follower 1 (read-only)
write → leader ──┤
        │   └── follower 2 (read-only)
        └── follower 3 (read-only)

Pros: dead simple to reason about. There is exactly one source of truth at any moment. Every mainstream relational DB ships this out of the box: Postgres streaming replication, MySQL primary-replica, SQL Server AlwaysOn.

Cons: the leader is a single point of failure for writes. Failover (promoting a follower) takes time and risks data loss if replication was async.

Sync vs async: under sync replication the leader waits for at least one follower to acknowledge before returning success — no write-loss on leader crash, but every write pays the network round trip. Under async, the leader returns immediately — fast, but a leader crash loses unflushed writes.

Multi-leader (a.k.a. active-active, multi-master)

Two or more nodes can each accept writes. Writes propagate between the leaders bi-directionally.

       ┌──────write──────┐
       │                 │
DC-A ◀═┴═══════════════════════ DC-B
       │                 │
       └──────write──────┘

Pros: writes succeed even when a region is offline. Lower write latency for users near each leader.

Cons: conflicts. The same row updated in DC-A and DC-B at the same instant has two competing values. You need a resolution strategy:

  • Last-write-wins (LWW) — simplest, loses data silently when the wall clock disagrees.
  • Vector clocks / version vectors — preserves both values; the application resolves.
  • CRDTs — special data types (counters, sets, maps) that converge automatically. Writes are commutative.

Multi-leader is the right answer for genuinely multi-region writes (collaboration tools, offline-first mobile sync). It is the wrong answer when you can get away with single-leader, because conflict resolution is always a source of bugs.

Leaderless (Dynamo-style)

There is no leader. Every node accepts every operation. Each write is sent to W of N replicas; each read is sent to R of N. As long as W + R > N, every read overlaps at least one replica that saw the latest write.

                ┌── node 1 (write to 2, read from 2)
client ─writes──┼── node 2
                └── node 3

Pros: no failover dance — the loss of one node is invisible. Tunable consistency per query.

Cons: every read may need to do read repair (notice that one replica is behind, send it the missing write). Conflicts can still happen; same resolution strategies as multi-leader. Application code is more complex (vector clocks bubble up).

Quorum math: with N=3, common configurations are W=2/R=2 (one node down still works), W=3/R=1 (all writes durable, fast reads), W=1/R=3 (fast writes, slow reads). The choice encodes your write-vs-read latency preference.

Cassandra, Riak, ScyllaDB, DynamoDB all use this pattern.

Split-brain

The classic failure mode of any multi-leader (or leader-with-failover) topology: a network partition isolates two leaders from each other, both believing they are the legitimate primary. Both accept writes. When the partition heals, you have two divergent histories that need merging.

The defence is never let two leaders coexist — which means failover requires a quorum-aware decision (a witness, an external arbiter, or a consensus protocol like Raft). Manual failover by a human operator is the slowest defence; consensus-driven failover is the strongest.

Raft (and Paxos, and ZAB)

When you want automatic, safe leader election and replicated log, you use a consensus algorithm. Raft is the de-facto standard because it was designed to be understandable; Paxos is older and more general; ZAB is what ZooKeeper uses.

Raft in 60 seconds:

  1. Each node has a role (leader, follower, candidate) and a term (a monotonically increasing election number).
  2. A follower that hears no heartbeat from a leader for a randomised timeout becomes a candidate, increments its term, and asks all peers for votes.
  3. A peer votes yes if it hasn't already voted in this term and the candidate's log is at least as up-to-date as its own.
  4. A candidate that wins a majority becomes leader for that term. It immediately sends heartbeats to suppress further elections.
  5. The leader appends entries to its log, replicates to followers, and commits an entry only when a majority has acknowledged it.

Three safety guarantees fall out:

  • Election safety — at most one leader per term.
  • Log matching — if two logs contain an entry at the same index with the same term, all preceding entries are identical.
  • Leader completeness — a committed entry survives all future terms.

In practice you run etcd, Consul, hashicorp/raft, or rqlite — never write your own. Read the Raft paper if you want to understand; use a battle-tested implementation if you want to ship.

Failover strategies

When the leader dies, who decides who's next?

  • Manual — a human operator runs the promote command. Slow (minutes), safe (no split-brain).
  • Witness-based — a third node arbitrates between two replicas. Fast (seconds), risk of witness flapping.
  • Consensus-driven (Raft / Paxos) — the cluster elects a new leader by majority vote. Fast, safest, requires odd number of voters.

The trade-off is how quickly do you need to recover vs how strict is your safety budget. Banks pick manual; most modern infra picks Raft.

Sync vs async — the data-loss window

This is the under-appreciated detail. Async replication means on leader crash, you lose every write that hadn't replicated yet. The window depends on your replication lag — usually milliseconds, sometimes seconds, occasionally minutes when network gets weird.

For a payments system, milliseconds of lost writes is unacceptable: use sync replication or a consensus protocol that requires majority ack.

For a social-feed service, seconds of lost likes is fine: use async, take the latency win.

Be deliberate. Most teams pick async without realising they've picked it (it's the default), then are surprised when failover loses the last 30s of orders.

CAP and the real question

Brewer's CAP theorem says: in a partition, you must choose between consistency and availability. In practice every real database has chosen one already:

  • CP — refuses writes during partition (etcd, Spanner, ZooKeeper). Strong reads, occasional unavailability.
  • AP — accepts writes on both sides (Cassandra, Dynamo). Always available, eventual consistency.

The interesting question is not "CP or AP" but "how do you handle the partition?" PACELC (Daniel Abadi) extends CAP: when there's no partition, do you favour latency (L) or consistency (C)? That's where the real engineering happens.

What can go wrong

Replica lag spikes. A follower falls behind. Reads from it return stale data. Monitor pg_stat_replication.lag, alert above threshold.

Asymmetric failover. The leader dies, a follower is promoted, the old leader comes back and accepts writes (split-brain). Use STONITH ("shoot the other node in the head") fencing to guarantee the old leader is dead before promoting.

Quorum loss. A 3-node Raft cluster loses 2 nodes. The remaining node refuses writes (no majority). To safely recover you need to manually reconfigure. 5-node clusters tolerate 2 losses.

Replication topology drift. A 'temporary' second leader becomes permanent. Now you've reinvented multi-leader, with none of the safeguards. Audit topology in production regularly.

Cross-region async lag. Your writes look fast in your local DC; cross-region reads are seconds behind. Surface this to clients (Cache-Control: max-age) or route reads to local followers.

Picking a topology

A reasonable progression:

  1. Single-leader async — the boring default. PostgreSQL, MySQL, SQL Server. Works for 95% of services.
  2. Single-leader sync when you cannot lose writes (payments, audit trails). Pay the latency cost.
  3. Raft-backed (CockroachDB, etcd, FoundationDB) when you need automatic failover with strict safety.
  4. Leaderless when you need write availability across regions and can swallow eventual consistency (Cassandra, DynamoDB).
  5. Multi-leader only when leaderless doesn't fit (e.g. two existing single-leader clusters that need bi-directional sync).

Choosing topology before you have evidence is premature optimisation. Start single-leader; measure; evolve.

Common tools in production

  • PostgreSQL streaming replication — the reference single-leader. Sync, async, or quorum.
  • MySQL Group Replication / Galera — multi-leader for MySQL with conflict detection.
  • etcd / Consul — Raft-backed KV. The trust anchors of cloud infra.
  • CockroachDB / TiDB / YugabyteDB — distributed SQL on per-range Raft.
  • Cassandra / ScyllaDB — Dynamo-style leaderless KV.

Diagram conventions

Three little diagrams worth memorising:

single-leader:        multi-leader:           leaderless:
client ─▶ L           A ◀══════▶ B            client ─▶ N1
         ├▶ F1                                       ├▶ N2  (W=2)
         └▶ F2                                       └▶ N3
                                              client ─◀ N1
                                                       N2  (R=2)
                                                       N3

In Mermaid, the single-leader version:

flowchart LR
    C[client] -->|writes + reads| L[Leader]
    C -.->|reads only| F1[Follower 1]
    C -.->|reads only| F2[Follower 2]
    L --> F1
    L --> F2

That's the diagram people draw on the whiteboard 80% of the time when "what database do you use?" comes up.

Tools in the wild

4 tools
  • Reference single-leader replication. WAL streamed to followers; sync or async.

    service
  • etcdfree tier

    Raft-backed distributed KV store. The brain of Kubernetes.

    service
  • Cassandrafree tier

    Dynamo-style leaderless replication. Tunable consistency per query (ONE/QUORUM/ALL).

    service
  • CockroachDBfree tier

    Per-range Raft groups; SQL semantics on top of distributed consensus.

    service