Bloom Filters
Probabilistic membership — tunable false positives, never false negatives.
Bloom Filters
A Bloom filter answers one question with two possible replies: definitely not or maybe. Built right, it does this in a few bits per item, way less than any exact set, and never produces a false negative. The trade-off you make is false positives — sometimes the filter says "maybe" when the answer is actually "no" — and the false-positive rate is yours to tune.
It's the data structure of choice when an exact lookup is much more expensive than a probabilistic lookup, and a "definitely not" answer can save you the work.
The structure
[ 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 ... ] ← bit array of size m
↑ ↑ ↑
hash_1(item)
hash_2(item)
hash_k(item)
Three things parameterise it:
m— number of bits in the array.n— expected number of items.k— number of independent hash functions.
k is tunable. The optimal value (minimising false-positive rate for a given m/n) is k ≈ (m/n) · ln(2) ≈ 0.693 · m/n. For ~10 bits per item (m/n = 10), k ≈ 7.
Insert
add(item):
for i in 1..k:
bit_array[ hash_i(item) mod m ] = 1
Set k bits to 1. That's it. No conditional, no allocation. Insert is O(k).
Membership check
mightContain(item):
for i in 1..k:
if bit_array[ hash_i(item) mod m ] == 0:
return false # definitely not
return true # maybe
If any of the k hash positions is 0, the item was definitely never added. If all k are 1, it was probably added — but you might be looking at bits set by other items by accident.
False-positive math
After n insertions into an m-bit filter with k hashes, the probability that a given bit is still 0 is:
(1 - 1/m)^(k·n) ≈ exp(-k·n / m)
The probability of a false positive (all k bits set by chance):
P(FP) ≈ (1 - exp(-k·n/m))^k
Plug in the numbers and you get the famous chart:
| bits/item (m/n) | optimal k | false-positive rate |
|---|---|---|
| 4 | 3 | ~14.7% |
| 6 | 4 | ~5.6% |
| 8 | 6 | ~2.1% |
| 10 | 7 | ~0.8% |
| 14 | 10 | ~0.1% |
| 20 | 14 | ~0.005% |
10 bits per item, 0.8% false-positive rate. For 10 million items: 12.5 MB of bits. The exact set (using a hash table of 64-bit pointers + entries) would be hundreds of MB. That's the size win.
What's NOT supported
- Delete. You can't safely clear bits — another item might depend on the same bit being set. Counting Bloom filters replace each bit with a small counter (4 bits) that you increment on insert and decrement on delete. Same FPR math, 4× the memory.
- Iterate / list contents. The filter knows nothing about the items you put in. It's purely a membership test.
- Exact membership. "Maybe" is the best you'll ever get. If you need certainty, use a hash set or a Cuckoo filter (newer alternative that does support delete).
Real-world uses
1. Cassandra read filtering
Cassandra stores data in immutable on-disk files (sstables). When a read comes in, naively you'd open every sstable and search. Instead, each sstable has a Bloom filter of its keys. The read first asks each filter "do you have key X?". A "definitely not" skips the I/O entirely; only "maybe" sstables get read. The savings are huge — typical hit rates are 10-100×.
2. Bitcoin SPV (BIP 37)
Light wallets don't want to download every block — they only care about transactions involving their addresses. They send a Bloom filter of their addresses to a peer. The peer relays only transactions that match. The filter hides which exact addresses the wallet cares about (privacy via false positives). Without Bloom filters, every transaction in every block would have to traverse the wire to the wallet.
3. Chrome's malware-list lookup
Google maintains a list of bad URLs. Chrome can't ship the entire list to every browser. Instead, it ships a Bloom filter; when a URL is "maybe" malicious, Chrome makes a network round-trip to the canonical service to check exactly. False positives cost a network round trip; false negatives would cost a malware infection. Bloom filters guarantee no false negatives.
4. CDNs and caches
Cloudflare and Akamai use Bloom filters to know which content might be in their disk cache. If the filter says "definitely not", the request goes upstream immediately; no point hitting disk.
5. Database query optimisation
Postgres / MySQL / others use Bloom-filter-like structures (often "Bloom join filters") in query planners to push predicates early — eliminating rows whose join keys are guaranteed non-matches.
Trade-offs vs alternatives
| Structure | False neg | False pos | Delete | Bits/item |
|---|---|---|---|---|
| Bloom filter | never | tunable | ✗ | ~10 for 1% |
| Counting Bloom | never | tunable | ✓ | ~40 for 1% |
| Cuckoo filter | never | tunable | ✓ | ~9 for 1% (slightly better) |
| Hash set | never | never | ✓ | ~256+ |
| HyperLogLog | n/a | n/a (cardinality estimate) | ✗ | ~12 KB total for 1% error |
For straight membership: Bloom if you don't need delete, Cuckoo if you do. Cuckoo is slightly more space-efficient and slightly faster; Bloom is the historical standard with universal library support.
When NOT to use a Bloom filter
- You need exactness. Use a hash set.
- You need to delete and restore items. Use a Cuckoo filter.
- The dataset is small enough to fit in a hash set. Don't over-engineer.
- You're counting, not testing membership. That's HyperLogLog or Count-Min Sketch.
Tuning rule of thumb
Pick the false-positive rate first, then size the filter. The Bloom Filter Calculator at hur.st/bloomfilter will tell you m and k for a given n and P(FP). As a rough mental model: 10 bits per item ≈ 1% FPR.
A concrete tiny example
Say m = 16, k = 3, three hash functions returning indices in [0, 16).
Insert "alice": hashes give 2, 9, 13 → set bits 2, 9, 13. Insert "bob": hashes give 5, 9, 14 → set bits 5, 9, 14. Insert "carol": hashes give 1, 6, 13 → set bits 1, 6, 13.
bit array: [_ C A _ _ B C _ _ AB _ _ _ AC B _]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Test "dave": hashes give 5, 6, 14. All three bits are set (by alice, bob, carol overlapping). False positive!
This is the fundamental Bloom-filter trade-off: more bits, more hashes, fewer collisions. Tune to your error budget.
Tools in the wild
4 tools- librarybloom-filter2 (PyPI)free tier
Pure-Python Bloom filter with sensible defaults. Tune by max_elements + error_rate.
- libraryRoaringBitmapfree tier
Modern alternative for set membership when you can spare a bit more memory and want exact answers.
- specCuckoo Filter (paper + impls)free tier
Successor to Bloom. Supports delete, similar FPR, slightly better space efficiency.
- service
Bloom + Cuckoo + Top-K + Count-Min as Redis modules. Production-ready, scalable.