A Bloom filter is a tiny, super-fast yes/no checker for membership: “Could this item be in my set?”
- It uses a small bit array and a few hash functions.
- It can say “definitely not” or “maybe yes.”
- “Maybe yes” needs a real database check to confirm.
- Great when you need speed + low memory, and can tolerate a small chance of false positives.
What is a Bloom filter? (Beginner version)
Imagine a long row of light switches (bits) all starting at 0 (off).
To add an item (like a username), we run it through a few different hash functions (little math machines). Each hash picks a switch to turn on.
To query an item, we hash it the same way and check those switches:
- If any is off → “Definitely not present.”
- If all are on → “Maybe present.” (it could be in the set, or it’s a false positive)
That’s it. No storing full items. Just flipping bits!
Why use the Bloom filter technique?
- Speed: O(1) time; a handful of hashes and bit checks.
- Tiny memory: Millions of items in just a few MB.
- Scale-friendly: Perfect as a front-door filter before hitting your database or API.
Trade-off: You can get false positives (says “maybe” even when not present), but never false negatives for items you added. If you need deletions, use a Counting Bloom filter or rebuild periodically.
How Big Tech checks your username in milliseconds
Scenario: “how to check if username is already taken”
- Prebuild the filter: From all existing usernames, build a Bloom filter and store it in fast memory (e.g., Redis/shared cache).
- User types a name:
@star_coder - Ask the filter:
- If the filter says “definitely not” → instantly show “Looks available” (then do a quick background verify before finalizing).
- If the filter says “maybe” → do a single exact lookup in the database:
- Found → “Username is taken.”
- Not found → it was a false positive; allow and add it to the DB, then update/rebuild the filter soon.
This setup avoids hammering your database for 99% of checks. Most queries end at the filter in micro- to milliseconds.
How Bloom filters work (the core steps)
- Choose bit array size m and number of hash functions k.
- Add(x): For each of the k hashes of
x, setbits[h_i(x)] = 1. - MightContain(x): If all those positions are 1 → maybe; if any is 0 → no.
Quick Python example (minimal, no third-party libs)
import hashlib
from math import log, ceil
class BloomFilter:
def __init__(self, n_items, fp_rate=0.01):
# Sizing formulas:
# m ≈ -(n * ln(p)) / (ln(2)^2), k ≈ (m/n) * ln(2)
ln2 = log(2)
m = int(ceil(-(n_items * log(fp_rate)) / (ln2**2)))
k = max(1, int(round((m / n_items) * ln2)))
self.m, self.k = m, k
self.bits = bytearray((m + 7) // 8)
def _hashes(self, item):
data = item.encode("utf-8")
h1 = int.from_bytes(hashlib.sha256(data).digest()[:8], "big")
h2 = int.from_bytes(hashlib.blake2b(data, digest_size=16).digest()[:8], "big")
for i in range(self.k):
# Double hashing: h_i(x) = (h1 + i*h2) % m
yield (h1 + i * h2) % self.m
def _setbit(self, idx):
self.bits[idx // 8] |= 1 << (idx % 8)
def _getbit(self, idx):
return (self.bits[idx // 8] >> (idx % 8)) & 1
def add(self, item):
for h in self._hashes(item):
self._setbit(h)
def might_contain(self, item):
return all(self._getbit(h) for h in self._hashes(item))
# Demo
bf = BloomFilter(n_items=1_000_000, fp_rate=0.01)
for name in ["alice", "bob", "charlie"]:
bf.add(name)
print(bf.might_contain("bob")) # True (likely present)
print(bf.might_contain("star_coder")) # False → definitely not present
How to use it for username checks
- On signup:
if not bf.might_contain(username):→ show “available (subject to final check)”. - Else → do a DB lookup to confirm.
How big should your Bloom filter be?
Use these sizing rules of thumb (rounded):
- Formulas (handy):
- m (bits) ≈
-(n * ln(p)) / (ln 2)^2 - k (hash count) ≈
(m/n) * ln 2 - Where
n= expected items,p= target false-positive rate.
- m (bits) ≈
- Examples:
- 1 million usernames, 1% FPR → ~1.14 MB, k ≈ 7.
- 1 million, 0.1% FPR → ~1.71 MB, k ≈ 10.
- 50 million, 1% FPR → ~57 MB, k ≈ 7.
- 50 million, 0.1% FPR → ~86 MB, k ≈ 10.
Pick a p your product can tolerate (e.g., 1% or 0.1%). Lower p = fewer false positives, a bit more memory and more hashes.
Production tips & best practices
- Hashes: Use fast, stable non-crypto hashes (e.g., Murmur/xxHash) or double-hashing trick from two base hashes.
- Normalization: Lowercase usernames; normalize Unicode so
"José"and"José"behave the same. - Updates:
- Adds are easy: set more bits.
- Deletes need Counting Bloom filters or periodic rebuilds.
- Rebuild cadence: For very large sets, rebuild every few minutes/hours via a job, then hot-swap.
- Split by shard: One filter per region/tenant can cut memory and increase cache locality.
- Edge checks: Keep a small “recently created usernames” exact set next to the filter to smooth over rebuild delays.
- Guardrails: Still do final exact DB/Key-Value check before reserving a username.
Pros & cons
Pros
- Blazing fast checks (a few hashes).
- Ultra-compact memory footprint.
- Simple to distribute and cache.
Cons
- False positives exist (tunable).
- No native deletes (without counting).
- Not for queries like “list all items” (it can’t enumerate contents).

