hashing · level 5

Merkle Trees

Prove one leaf in O(log N) without shipping the whole set.

200 XP

Merkle Trees

A Merkle tree (invented by Ralph Merkle, 1979) is a binary tree whose every internal node is the hash of its two children. The top of the tree is a single hash — the Merkle root — that fingerprints the entire set of leaves.

Change any single leaf and the root changes. That's the whole point.

Analogy

Imagine a tournament bracket for a million-player chess competition. Each match produces a winner who advances; pairs of winners become the next round's matches; the whole tournament funnels down to a single champion at the top. To prove to a friend that you played in the tournament, you don't need to hand over the names of all million players — you just need the short chain of opponents you faced on your way up the bracket, all the way to the champion. A Merkle tree is exactly that bracket: the champion is the root, the original data are the first-round players, and any one player's participation can be verified by a tiny sibling-chain proof instead of shipping the whole draw.

              ROOT
             /    \
           ab      cd
          /  \    /  \
        H(a) H(b) H(c) H(d)
         |    |    |    |
         a    b    c    d          ← leaves (original data, hashed)

The superpower: O(log N) membership proofs

To prove that leaf b is in a tree of a million items, you don't ship all million. You ship the sibling chain — the sibling at each level from b up to the root.

For b in the tree above you'd ship: H(a), then cd. The verifier hashes:

step 1:  H( H(a) || H(b) ) = ab
step 2:  H( ab || cd ) = root

If the computed root matches the one they trust, b is in the set. Proof size: log₂(N) hashes. For a million leaves that's 20 SHA-256 digests — 640 bytes.

Where you'll see it

Git. Every commit points to a tree object; trees point to blobs. The commit SHA covers everything. Two repos with the same commit SHA have byte-identical contents.

Bitcoin. Each block header commits to a Merkle root of its transactions. Light clients (SPV) can verify that a transaction is in a block without downloading the block — they ask a full node for the Merkle path.

Certificate Transparency. Every TLS certificate issued by a public CA is logged into a Merkle-tree log. Browsers demand SCTs (signed certificate timestamps) proving a cert was logged. RFC 9162 is the spec.

IPFS, Docker layers, ZFS snapshots, Cassandra anti-entropy, AWS DynamoDB streams. Anywhere you need to prove "this blob is part of this snapshot" cheaply.

Building one

function build(leaves, H) {
  let level = leaves.map(H);           // hash every leaf
  const tree = [level];
  while (level.length > 1) {
    if (level.length % 2) level.push(level.at(-1));  // pad odd count
    const next = [];
    for (let i = 0; i < level.length; i += 2) {
      next.push(H(level[i] + level[i + 1]));
    }
    tree.push(next);
    level = next;
  }
  return { root: level[0], tree };
}

A few subtleties in production trees:

  • Odd-count padding. Bitcoin duplicates the last hash. Others (like Certificate Transparency) carry it up unduplicated. The choice matters for the proof format but not for security.
  • Domain separation. Good implementations tag leaves and internal nodes differently (e.g. prefix a leaf with 0x00, an internal node with 0x01) to prevent a "second-preimage via leaf/internal collision" attack.
  • Hash function choice. SHA-256, SHA-512, BLAKE3. Anything collision-resistant works.

Why it works

The chain of hashes is an inductive argument: if the root is unchanged, the leaf I provided must be the one that produced the root. The adversary would need a collision at some internal node to forge a proof — which is exactly the property the underlying hash is supposed to provide.

So: a Merkle tree is how you stretch the security of a single hash over an arbitrary-sized set, with cheap verification.