Content Addressing
Git, IPFS, and the 'address by content not name' mental model.
Content Addressing
Most storage systems answer the question "where is X?". Content-addressed storage flips it: the address IS the content. You don't choose where a file lives — you ask a hash function and the answer is its forever-name.
This is the mental model behind Git, IPFS, content-defined chunking, deduplicating backups, BitTorrent's info-hash, every modern container registry, and most peer-to-peer systems. Once you see it, you stop being surprised that "different systems converge on the same identifier for identical bytes" is a feature, not a coincidence.
The shift
Traditional storage:
"please give me the file at /home/sam/notes.txt"
↓
filesystem lookup
↓
bytes returned
The bytes can change without warning; the path is what's stable.
Content-addressed storage:
"please give me the bytes whose SHA-256 is 2cf24dba…3043…9824"
↓
any node, any cache
↓
bytes returned (and verifiable)
The address is derived from the bytes, not assigned. If anyone returns different bytes, you instantly notice — they won't hash to the address you asked for.
Three superpowers, one rule
Once your address is the hash of the content, three properties fall out for free:
- Deduplication. Two files with identical bytes have identical addresses; the storage layer can keep one copy. Restic and Borg backups exploit this within a single repo (and across machines if you set up a shared bucket).
- Tamper-evidence. Any byte change yields a different address. You can't silently mutate stored content; consumers fetching by the old address get the old bytes (or 404, but never silently-modified bytes).
- Inherent integrity. A consumer doesn't have to trust the server. They re-hash what they got and compare to the requested address. Untrusted CDNs become safe caches.
The single rule that makes all of this work: the hash function must be collision-resistant. If an attacker can craft A and B with H(A) = H(B), the whole edifice collapses. This is why Git is migrating from SHA-1 (which has known collisions, even if not yet exploitable in practice for real Git objects) to SHA-256.
Git — the canonical example
Every Git object — blob (file content), tree (directory listing), commit (snapshot + parent), tag (named pointer) — is named by the SHA-1 hash of its serialised form.
blob: sha1("blob <length>\0<content>")
tree: sha1("tree <length>\0<entries>")
commit: sha1("commit <length>\0<header>\n<message>")
Try it yourself:
$ echo -n "hello" | git hash-object --stdin
b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0
$ printf "blob 5\0hello" | sha1sum
b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0 -
Same hash. Git's "object database" (.git/objects/) is a key-value store keyed on those hashes. A commit's parent is just another commit hash. The whole DAG of history is built from references between content-addresses.
This means:
- Two repos with identical content have identical commit hashes. Forks share storage if you've configured them to.
- Cloning is incremental — you only download object hashes you don't already have.
- Tampering with a single byte of a file in any past commit changes that file's hash, which changes the tree hash, which changes the commit hash, which changes every descendant commit hash. Detected immediately.
IPFS — content addressing, distributed
IPFS extends the idea to a distributed network. Every file has a CID (Content Identifier) — a self-describing label that includes the hash, the hash algorithm, the codec for the bytes, and a multibase prefix:
CIDv1 in base32:
bafy bei gd irzx5n bbcyx 4kqzcqyc 5buy yj7akrhubs7uquvxn7gzd7s i
| header | hash bytes (sha2-256, 32 bytes) |
Decoded:
| Part | Meaning |
|---|---|
bafy... (multibase prefix) |
base32 |
0x01 |
CID version 1 |
0x70 |
codec: dag-pb (IPFS legacy structure) |
0x12 0x20 |
multihash: sha2-256, 32-byte digest |
| 32 bytes | the digest |
The reason for the wrapper: algorithm agility. CIDv1 supports any hash function (sha2-256, blake2b, blake3, ...), any codec (dag-pb, dag-cbor, raw bytes), any base encoding. The same identifier shape covers a 2014 IPFS object and a 2024 BLAKE3-hashed Filecoin sector.
Practically: ipfs add file.pdf returns a CID. Anyone, anywhere on the IPFS network, can retrieve those bytes by CID — and verify them locally because the CID is the hash.
Where else you'll see it
- Container registries. Image layers are stored by their SHA-256 digest.
docker pull foo@sha256:abc...retrieves a specific layer; if it's already in the local cache, no network traffic. - Magnet links / BitTorrent. A magnet URI's
xt=urn:btih:field is the SHA-1 hash of the torrent's info dict — peers find each other by content address. - Deduplicating backups. Restic, Borg, ZFS deduplication. Same bytes → same chunk hash → stored once.
- CDN cache keys. When you cache a build artefact, keying by hash of inputs (instead of by version string) means a no-op build is a free cache hit.
- Subresource Integrity.
<script src="..." integrity="sha384-...">— the browser verifies the script matches the hash before executing. - Nix / Guix. Build artefacts named by hash of all inputs. Same inputs → same hash → no re-build.
- macOS Time Machine. Per-file SHA-256 dedup across snapshots.
In each case, the trick is the same: derive the identifier from the bytes, not from a name.
Caveats
- Renames are noise. A file renamed from
a.txttob.txthas the same content hash. Git tracks renames heuristically by content, not by filename. Most other content-addressed systems just don't think about names. - Mutable references need an extra layer. "The latest version of foo" can't be a hash, because hashes don't change. Git uses refs (branches, tags) which are mutable pointers to immutable hashes. IPFS uses IPNS or DNSLink for the same purpose. The hash store is immutable; mutable names live above it.
- Garbage collection. When nothing references a hash, you can delete it. But "nothing references" requires walking the whole graph. Both Git and IPFS have GC steps that walk live refs.
- Hash transitions are painful. Git's SHA-1 → SHA-256 migration has been ongoing for years. Once your identifiers are baked into history, everywhere, switching algorithms means dual-stacking for the foreseeable future.
The bigger idea
The universal address pattern is "compose a stable identity from inputs you trust". Content addressing is one example: identity = hash of bytes. Public keys (asymmetric cryptography) are another: identity = hash of a public key. Self-sovereign IDs (DIDs), Kademlia DHT keys, the IPFS CID, all use the pattern.
Once you internalise it, you start seeing everywhere a system could be content-addressed but isn't — and the friction those systems carry as a result. (Looking at you, every API that returns different bytes for the same URL.)
Tools in the wild
4 tools- cligitfree tier
Origin story for content addressing in mainstream tooling. `git hash-object` shows the math.
- cliIPFS / Kubofree tier
Distributed content-addressed file system. `ipfs add` returns a CID; `ipfs cat <cid>` retrieves anywhere.
- cliResticfree tier
Backup tool built on content addressing. Identical bytes across machines = stored once.
- cliborgfree tier
Deduplicating archiver. Chunks files, hashes chunks, stores by content. Same idea, smaller scope than IPFS.