Hash Tables & Collisions
Buckets, chains, and probes — predict the final table layout.
Hash Tables & Collisions
Pick any key — a username, a URL, a product ID — and a hash table tells you exactly one slot to go look in, in constant time. That's the whole pitch: O(1) average-case lookup, insert, and delete, for any set of keys a language can turn into bytes. The price is two design choices that everything else hangs off of — your hash function and your collision strategy.
Analogy
Think of a classroom with 40 numbered lockers. When a new student arrives, the teacher runs their name through a formula (first letter of surname → locker number) and sends them straight to that locker. No searching the hallway. But if two students hash to the same locker, what then? Either they share the locker (stacking their stuff on a shelf — chaining), or the second student walks to the next empty locker and claims that one (open addressing / probing). Both strategies work — they just fail differently under pressure.
The core: hash + modulo
A hash table has two moving parts:
- A hash function — takes any key (string, number, object) and returns an integer. Good hash functions spread the output uniformly across the integer range — flip one bit of input, roughly half the output bits change (the avalanche property).
- Modulo the capacity — map that huge integer down to a bucket index in
[0, capacity). In code:index = hash(key) % capacity.
function fnv1a(str) {
let h = 2166136261 >>> 0;
for (const ch of str) h = Math.imul(h ^ ch.charCodeAt(0), 16777619) >>> 0;
return h;
}
const index = fnv1a("banana") % capacity;
FNV-1a is a tiny, non-cryptographic hash. Real-world tables use SipHash, xxHash, or specialised variants, but the shape is the same.
Load factor — the critical knob
The load factor is α = n/m where n is the number of stored keys and m is the capacity. It's the single most important tuning parameter of a hash table.
- At α = 0.5, collisions are rare.
- At α = 0.75, chaining still feels snappy but open addressing is starting to thrash.
- At α ≥ 1, open addressing is full — there's literally nowhere to put a new key.
Real implementations resize (double capacity and rehash every key) when α crosses a threshold — 0.75 for chained Java HashMaps, 0.5–0.66 for open-addressed Python dicts. Resizing is expensive but amortises across many inserts to still give O(1) expected cost.
Strategy 1: Chaining
Each bucket holds a list (classically a linked list, modernly an array or even a tree) of the keys that landed there. Collisions just append.
- Wins at high α. Chains grow; lookup is O(1 + α) on average. Even α = 2 is survivable.
- Easy deletes. Just remove from the list — no "tombstone" bookkeeping.
- Loses on memory. Each chain node is a separate allocation, which means pointer chasing and cache misses.
- Java's HashMap treeifies chains longer than 8 — log-n worst case guaranteed even under adversarial hash collisions.
Strategy 2: Open addressing (probing)
Each bucket is a single slot. On collision, walk forward until you find an empty one. Three common walks:
- Linear probing — try
h + 1, h + 2, h + 3, ...— simple, cache-friendly, but clusters badly (once a run of occupied slots forms, it grows). - Quadratic probing — try
h + 1, h + 4, h + 9, ...— breaks up primary clusters. - Double hashing — use a second hash function to decide the step size:
h1 + i·h2(key)— best distribution.
Open addressing wins when memory layout matters: every bucket is one contiguous array, cache lines stay hot, no pointer chasing. But deletes are painful — you can't just clear a slot or probes for later keys might skip over the gap and report "not found." Typical fix: mark the slot as a tombstone and sweep periodically.
When to pick which
| Chaining | Open addressing | |
|---|---|---|
| High α (≥ 0.75) | ✅ degrades gracefully | ❌ starts thrashing |
| Cache locality | ❌ pointer chase | ✅ contiguous |
| Delete complexity | ✅ just unlink | ⚠️ tombstones |
| Worst-case clustering | mild | severe (linear) |
| Used by | Java HashMap, Ruby Hash | Python dict, C++ flat_hash_map |
Rule of thumb: if you're writing a general-purpose container in a GC'd language, chaining is safer. If you're squeezing every nanosecond in a hot path, open addressing with a good probe strategy wins.
Gotchas
- A bad hash = O(n) forever. If every key hashes to the same bucket, you built a linked list with extra steps. Run a quick histogram test on real keys.
- Hash-flooding DoS. An attacker who can predict your hash function can jam every input into one bucket. Fix: use a randomised hash (SipHash, keyed at process start) — this is why Python, Rust, and V8 all randomise their hash seeds.
- Identity vs equality. In languages with structural equality, two objects might be "equal" but hash differently. Your type must satisfy
a == b ⇒ hash(a) == hash(b)— Java'sequals/hashCodecontract, Python's__eq__/__hash__contract. - Mutating a key that's in the table. The key's hash changes; the table can no longer find it. Most languages either freeze hash-keyed types (Python uses tuples not lists as dict keys) or document "don't do that" (Java).
- Resize in a hot loop. If you insert 10,000 items into a default-capacity HashMap, it'll resize 4+ times. Pre-size with the expected capacity to avoid the churn.
Worked example — capacity 4, chaining vs probing
Insert apple, banana, cherry, date into a table of capacity 4. Let's say the hashes mod 4 work out to be:
| key | hash mod 4 |
|---|---|
| apple | 2 |
| banana | 1 |
| cherry | 2 |
| date | 3 |
Chaining — each insert goes to its own hash slot. cherry collides with apple at bucket 2, so the chain there becomes [apple, cherry]. Final:
bucket 0: —
bucket 1: [banana]
bucket 2: [apple, cherry]
bucket 3: [date]
Linear probing — cherry hits bucket 2 (occupied by apple) and walks forward: bucket 3 is taken by date? No, wait — cherry comes before date in insertion order. Recompute: apple → 2, banana → 1, then cherry hits 2 (occupied), probes to 3, lands there. Now date hits 3 (occupied by the freshly probed cherry), probes to 0, lands there. Final:
bucket 0: date
bucket 1: banana
bucket 2: apple
bucket 3: cherry
Same keys, same capacity, radically different final layouts — and the order of insertion matters under probing in a way it doesn't under chaining.
Real-world examples
- JavaScript objects (V8) are hash tables with hidden classes for common key sets; large/sparse keys fall back to a dictionary representation.
- Python dict — open addressing with a perturbation probe sequence (not pure linear), randomised SipHash, typical load-factor threshold ~0.66.
- Java HashMap — chaining, threshold 0.75, tree bucket conversion at chain length 8.
- Ruby Hash — open addressing since 2.4, preserves insertion order separately in an index array.
- Redis — incremental rehashing: during a resize, each operation migrates one bucket, so a huge dict never stalls.
Try it in the playground
Enter a list of keys, slide the capacity, and flip the strategy toggle. Watch how the same input produces a completely different final table — and how at capacity 4 with four keys (α = 1.0), chaining still works while probing has zero slack left.