hashing · level 7

Content Addressing

Git, IPFS, and the 'address by content not name' mental model.

200 XP

Content Addressing

Most storage systems answer the question "where is X?". Content-addressed storage flips it: the address IS the content. You don't choose where a file lives — you ask a hash function and the answer is its forever-name.

This is the mental model behind Git, IPFS, content-defined chunking, deduplicating backups, BitTorrent's info-hash, every modern container registry, and most peer-to-peer systems. Once you see it, you stop being surprised that "different systems converge on the same identifier for identical bytes" is a feature, not a coincidence.

The shift

Traditional storage:

"please give me the file at /home/sam/notes.txt"
              ↓
       filesystem lookup
              ↓
        bytes returned

The bytes can change without warning; the path is what's stable.

Content-addressed storage:

"please give me the bytes whose SHA-256 is 2cf24dba…3043…9824"
              ↓
       any node, any cache
              ↓
        bytes returned (and verifiable)

The address is derived from the bytes, not assigned. If anyone returns different bytes, you instantly notice — they won't hash to the address you asked for.

Three superpowers, one rule

Once your address is the hash of the content, three properties fall out for free:

  1. Deduplication. Two files with identical bytes have identical addresses; the storage layer can keep one copy. Restic and Borg backups exploit this within a single repo (and across machines if you set up a shared bucket).
  2. Tamper-evidence. Any byte change yields a different address. You can't silently mutate stored content; consumers fetching by the old address get the old bytes (or 404, but never silently-modified bytes).
  3. Inherent integrity. A consumer doesn't have to trust the server. They re-hash what they got and compare to the requested address. Untrusted CDNs become safe caches.

The single rule that makes all of this work: the hash function must be collision-resistant. If an attacker can craft A and B with H(A) = H(B), the whole edifice collapses. This is why Git is migrating from SHA-1 (which has known collisions, even if not yet exploitable in practice for real Git objects) to SHA-256.

Git — the canonical example

Every Git object — blob (file content), tree (directory listing), commit (snapshot + parent), tag (named pointer) — is named by the SHA-1 hash of its serialised form.

blob:    sha1("blob <length>\0<content>")
tree:    sha1("tree <length>\0<entries>")
commit:  sha1("commit <length>\0<header>\n<message>")

Try it yourself:

$ echo -n "hello" | git hash-object --stdin
b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0

$ printf "blob 5\0hello" | sha1sum
b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0  -

Same hash. Git's "object database" (.git/objects/) is a key-value store keyed on those hashes. A commit's parent is just another commit hash. The whole DAG of history is built from references between content-addresses.

This means:

  • Two repos with identical content have identical commit hashes. Forks share storage if you've configured them to.
  • Cloning is incremental — you only download object hashes you don't already have.
  • Tampering with a single byte of a file in any past commit changes that file's hash, which changes the tree hash, which changes the commit hash, which changes every descendant commit hash. Detected immediately.

IPFS — content addressing, distributed

IPFS extends the idea to a distributed network. Every file has a CID (Content Identifier) — a self-describing label that includes the hash, the hash algorithm, the codec for the bytes, and a multibase prefix:

CIDv1 in base32:
bafy bei gd irzx5n bbcyx 4kqzcqyc 5buy yj7akrhubs7uquvxn7gzd7s i

| header | hash bytes (sha2-256, 32 bytes) |

Decoded:

Part Meaning
bafy... (multibase prefix) base32
0x01 CID version 1
0x70 codec: dag-pb (IPFS legacy structure)
0x12 0x20 multihash: sha2-256, 32-byte digest
32 bytes the digest

The reason for the wrapper: algorithm agility. CIDv1 supports any hash function (sha2-256, blake2b, blake3, ...), any codec (dag-pb, dag-cbor, raw bytes), any base encoding. The same identifier shape covers a 2014 IPFS object and a 2024 BLAKE3-hashed Filecoin sector.

Practically: ipfs add file.pdf returns a CID. Anyone, anywhere on the IPFS network, can retrieve those bytes by CID — and verify them locally because the CID is the hash.

Where else you'll see it

  • Container registries. Image layers are stored by their SHA-256 digest. docker pull foo@sha256:abc... retrieves a specific layer; if it's already in the local cache, no network traffic.
  • Magnet links / BitTorrent. A magnet URI's xt=urn:btih: field is the SHA-1 hash of the torrent's info dict — peers find each other by content address.
  • Deduplicating backups. Restic, Borg, ZFS deduplication. Same bytes → same chunk hash → stored once.
  • CDN cache keys. When you cache a build artefact, keying by hash of inputs (instead of by version string) means a no-op build is a free cache hit.
  • Subresource Integrity. <script src="..." integrity="sha384-..."> — the browser verifies the script matches the hash before executing.
  • Nix / Guix. Build artefacts named by hash of all inputs. Same inputs → same hash → no re-build.
  • macOS Time Machine. Per-file SHA-256 dedup across snapshots.

In each case, the trick is the same: derive the identifier from the bytes, not from a name.

Caveats

  • Renames are noise. A file renamed from a.txt to b.txt has the same content hash. Git tracks renames heuristically by content, not by filename. Most other content-addressed systems just don't think about names.
  • Mutable references need an extra layer. "The latest version of foo" can't be a hash, because hashes don't change. Git uses refs (branches, tags) which are mutable pointers to immutable hashes. IPFS uses IPNS or DNSLink for the same purpose. The hash store is immutable; mutable names live above it.
  • Garbage collection. When nothing references a hash, you can delete it. But "nothing references" requires walking the whole graph. Both Git and IPFS have GC steps that walk live refs.
  • Hash transitions are painful. Git's SHA-1 → SHA-256 migration has been ongoing for years. Once your identifiers are baked into history, everywhere, switching algorithms means dual-stacking for the foreseeable future.

The bigger idea

The universal address pattern is "compose a stable identity from inputs you trust". Content addressing is one example: identity = hash of bytes. Public keys (asymmetric cryptography) are another: identity = hash of a public key. Self-sovereign IDs (DIDs), Kademlia DHT keys, the IPFS CID, all use the pattern.

Once you internalise it, you start seeing everywhere a system could be content-addressed but isn't — and the friction those systems carry as a result. (Looking at you, every API that returns different bytes for the same URL.)

Tools in the wild

4 tools
  • gitfree tier

    Origin story for content addressing in mainstream tooling. `git hash-object` shows the math.

    cli
  • IPFS / Kubofree tier

    Distributed content-addressed file system. `ipfs add` returns a CID; `ipfs cat <cid>` retrieves anywhere.

    cli
  • Resticfree tier

    Backup tool built on content addressing. Identical bytes across machines = stored once.

    cli
  • borgfree tier

    Deduplicating archiver. Chunks files, hashes chunks, stores by content. Same idea, smaller scope than IPFS.

    cli