hashing · level 9

Bloom Filters

Probabilistic membership — tunable false positives, never false negatives.

200 XP

Bloom Filters

A Bloom filter answers one question with two possible replies: definitely not or maybe. Built right, it does this in a few bits per item, way less than any exact set, and never produces a false negative. The trade-off you make is false positives — sometimes the filter says "maybe" when the answer is actually "no" — and the false-positive rate is yours to tune.

It's the data structure of choice when an exact lookup is much more expensive than a probabilistic lookup, and a "definitely not" answer can save you the work.

The structure

[ 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 ... ]   ← bit array of size m
       ↑     ↑           ↑
       hash_1(item)
                 hash_2(item)
                                 hash_k(item)

Three things parameterise it:

  • m — number of bits in the array.
  • n — expected number of items.
  • k — number of independent hash functions.

k is tunable. The optimal value (minimising false-positive rate for a given m/n) is k ≈ (m/n) · ln(2) ≈ 0.693 · m/n. For ~10 bits per item (m/n = 10), k ≈ 7.

Insert

add(item):
    for i in 1..k:
        bit_array[ hash_i(item) mod m ] = 1

Set k bits to 1. That's it. No conditional, no allocation. Insert is O(k).

Membership check

mightContain(item):
    for i in 1..k:
        if bit_array[ hash_i(item) mod m ] == 0:
            return false                         # definitely not
    return true                                  # maybe

If any of the k hash positions is 0, the item was definitely never added. If all k are 1, it was probably added — but you might be looking at bits set by other items by accident.

False-positive math

After n insertions into an m-bit filter with k hashes, the probability that a given bit is still 0 is:

(1 - 1/m)^(k·n) ≈ exp(-k·n / m)

The probability of a false positive (all k bits set by chance):

P(FP) ≈ (1 - exp(-k·n/m))^k

Plug in the numbers and you get the famous chart:

bits/item (m/n) optimal k false-positive rate
4 3 ~14.7%
6 4 ~5.6%
8 6 ~2.1%
10 7 ~0.8%
14 10 ~0.1%
20 14 ~0.005%

10 bits per item, 0.8% false-positive rate. For 10 million items: 12.5 MB of bits. The exact set (using a hash table of 64-bit pointers + entries) would be hundreds of MB. That's the size win.

What's NOT supported

  • Delete. You can't safely clear bits — another item might depend on the same bit being set. Counting Bloom filters replace each bit with a small counter (4 bits) that you increment on insert and decrement on delete. Same FPR math, 4× the memory.
  • Iterate / list contents. The filter knows nothing about the items you put in. It's purely a membership test.
  • Exact membership. "Maybe" is the best you'll ever get. If you need certainty, use a hash set or a Cuckoo filter (newer alternative that does support delete).

Real-world uses

1. Cassandra read filtering

Cassandra stores data in immutable on-disk files (sstables). When a read comes in, naively you'd open every sstable and search. Instead, each sstable has a Bloom filter of its keys. The read first asks each filter "do you have key X?". A "definitely not" skips the I/O entirely; only "maybe" sstables get read. The savings are huge — typical hit rates are 10-100×.

2. Bitcoin SPV (BIP 37)

Light wallets don't want to download every block — they only care about transactions involving their addresses. They send a Bloom filter of their addresses to a peer. The peer relays only transactions that match. The filter hides which exact addresses the wallet cares about (privacy via false positives). Without Bloom filters, every transaction in every block would have to traverse the wire to the wallet.

3. Chrome's malware-list lookup

Google maintains a list of bad URLs. Chrome can't ship the entire list to every browser. Instead, it ships a Bloom filter; when a URL is "maybe" malicious, Chrome makes a network round-trip to the canonical service to check exactly. False positives cost a network round trip; false negatives would cost a malware infection. Bloom filters guarantee no false negatives.

4. CDNs and caches

Cloudflare and Akamai use Bloom filters to know which content might be in their disk cache. If the filter says "definitely not", the request goes upstream immediately; no point hitting disk.

5. Database query optimisation

Postgres / MySQL / others use Bloom-filter-like structures (often "Bloom join filters") in query planners to push predicates early — eliminating rows whose join keys are guaranteed non-matches.

Trade-offs vs alternatives

Structure False neg False pos Delete Bits/item
Bloom filter never tunable ~10 for 1%
Counting Bloom never tunable ~40 for 1%
Cuckoo filter never tunable ~9 for 1% (slightly better)
Hash set never never ~256+
HyperLogLog n/a n/a (cardinality estimate) ~12 KB total for 1% error

For straight membership: Bloom if you don't need delete, Cuckoo if you do. Cuckoo is slightly more space-efficient and slightly faster; Bloom is the historical standard with universal library support.

When NOT to use a Bloom filter

  • You need exactness. Use a hash set.
  • You need to delete and restore items. Use a Cuckoo filter.
  • The dataset is small enough to fit in a hash set. Don't over-engineer.
  • You're counting, not testing membership. That's HyperLogLog or Count-Min Sketch.

Tuning rule of thumb

Pick the false-positive rate first, then size the filter. The Bloom Filter Calculator at hur.st/bloomfilter will tell you m and k for a given n and P(FP). As a rough mental model: 10 bits per item ≈ 1% FPR.

A concrete tiny example

Say m = 16, k = 3, three hash functions returning indices in [0, 16).

Insert "alice": hashes give 2, 9, 13 → set bits 2, 9, 13. Insert "bob": hashes give 5, 9, 14 → set bits 5, 9, 14. Insert "carol": hashes give 1, 6, 13 → set bits 1, 6, 13.

bit array:  [_  C  A  _  _  B  C  _  _  AB  _  _  _  AC  B  _]
             0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15

Test "dave": hashes give 5, 6, 14. All three bits are set (by alice, bob, carol overlapping). False positive!

This is the fundamental Bloom-filter trade-off: more bits, more hashes, fewer collisions. Tune to your error budget.

Tools in the wild

4 tools
  • Pure-Python Bloom filter with sensible defaults. Tune by max_elements + error_rate.

    library
  • RoaringBitmapfree tier

    Modern alternative for set membership when you can spare a bit more memory and want exact answers.

    library
  • Successor to Bloom. Supports delete, similar FPR, slightly better space efficiency.

    spec
  • Bloom + Cuckoo + Top-K + Count-Min as Redis modules. Production-ready, scalable.

    service