hashing · level 5

Merkle Trees

Prove one leaf in O(log N) without shipping the whole set.

200 XP

Merkle Trees

A Merkle tree (invented by Ralph Merkle, 1979) is a binary tree whose every internal node is the hash of its two children. The top of the tree is a single hash — the Merkle root — that fingerprints the entire set of leaves.

Change any single leaf and the root changes. That's the whole point.

Analogy

Imagine a tournament bracket for a million-player chess competition. Each match produces a winner who advances; pairs of winners become the next round's matches; the whole tournament funnels down to a single champion at the top. To prove to a friend that you played in the tournament, you don't need to hand over the names of all million players — you just need the short chain of opponents you faced on your way up the bracket, all the way to the champion. A Merkle tree is exactly that bracket: the champion is the root, the original data are the first-round players, and any one player's participation can be verified by a tiny sibling-chain proof instead of shipping the whole draw.

              ROOT
             /    \
           ab      cd
          /  \    /  \
        H(a) H(b) H(c) H(d)
         |    |    |    |
         a    b    c    d          ← leaves (original data, hashed)

The superpower: O(log N) membership proofs

To prove that leaf b is in a tree of a million items, you don't ship all million. You ship the sibling chain — the sibling at each level from b up to the root.

For b in the tree above you'd ship: H(a), then cd. The verifier hashes:

step 1:  H( H(a) || H(b) ) = ab
step 2:  H( ab || cd ) = root

If the computed root matches the one they trust, b is in the set. Proof size: log₂(N) hashes. For a million leaves that's 20 SHA-256 digests — 640 bytes.

Where you'll see it

Git. Every commit points to a tree object; trees point to blobs. The commit SHA covers everything. Two repos with the same commit SHA have byte-identical contents.

Bitcoin. Each block header commits to a Merkle root of its transactions. Light clients (SPV) can verify that a transaction is in a block without downloading the block — they ask a full node for the Merkle path.

Certificate Transparency. Every TLS certificate issued by a public CA is logged into a Merkle-tree log. Browsers demand SCTs (signed certificate timestamps) proving a cert was logged. RFC 9162 is the spec.

IPFS, Docker layers, ZFS snapshots, Cassandra anti-entropy, AWS DynamoDB streams. Anywhere you need to prove "this blob is part of this snapshot" cheaply.

Building one

function build(leaves, H) {
  let level = leaves.map(H);           // hash every leaf
  const tree = [level];
  while (level.length > 1) {
    if (level.length % 2) level.push(level.at(-1));  // pad odd count
    const next = [];
    for (let i = 0; i < level.length; i += 2) {
      next.push(H(level[i] + level[i + 1]));
    }
    tree.push(next);
    level = next;
  }
  return { root: level[0], tree };
}

A few subtleties in production trees:

Odd-count padding. Bitcoin duplicates the last hash. Others (like Certificate Transparency) carry it up unduplicated. The choice matters for the proof format but not for security.
Domain separation. Good implementations tag leaves and internal nodes differently (e.g. prefix a leaf with 0x00, an internal node with 0x01) to prevent a "second-preimage via leaf/internal collision" attack.
Hash function choice. SHA-256, SHA-512, BLAKE3. Anything collision-resistant works.

Why it works

The chain of hashes is an inductive argument: if the root is unchanged, the leaf I provided must be the one that produced the root. The adversary would need a collision at some internal node to forge a proof — which is exactly the property the underlying hash is supposed to provide.

So: a Merkle tree is how you stretch the security of a single hash over an arbitrary-sized set, with cheap verification.