How Big Tech Checks Your Username in Milliseconds(Bloom filter)

A Bloom filter is a tiny, super-fast yes/no checker for membership: “Could this item be in my set?”

  • It uses a small bit array and a few hash functions.
  • It can say “definitely not” or “maybe yes.”
  • “Maybe yes” needs a real database check to confirm.
  • Great when you need speed + low memory, and can tolerate a small chance of false positives.

What is a Bloom filter? (Beginner version)

Imagine a long row of light switches (bits) all starting at 0 (off).
To add an item (like a username), we run it through a few different hash functions (little math machines). Each hash picks a switch to turn on.
To query an item, we hash it the same way and check those switches:

  • If any is off → “Definitely not present.”
  • If all are on → “Maybe present.” (it could be in the set, or it’s a false positive)

That’s it. No storing full items. Just flipping bits!


Why use the Bloom filter technique?

  • Speed: O(1) time; a handful of hashes and bit checks.
  • Tiny memory: Millions of items in just a few MB.
  • Scale-friendly: Perfect as a front-door filter before hitting your database or API.

Trade-off: You can get false positives (says “maybe” even when not present), but never false negatives for items you added. If you need deletions, use a Counting Bloom filter or rebuild periodically.


How Big Tech checks your username in milliseconds

Scenario: “how to check if username is already taken”

  1. Prebuild the filter: From all existing usernames, build a Bloom filter and store it in fast memory (e.g., Redis/shared cache).
  2. User types a name: @star_coder
  3. Ask the filter:
    • If the filter says “definitely not” → instantly show “Looks available” (then do a quick background verify before finalizing).
    • If the filter says “maybe” → do a single exact lookup in the database:
      • Found → “Username is taken.”
      • Not found → it was a false positive; allow and add it to the DB, then update/rebuild the filter soon.

This setup avoids hammering your database for 99% of checks. Most queries end at the filter in micro- to milliseconds.


How Bloom filters work (the core steps)

  1. Choose bit array size m and number of hash functions k.
  2. Add(x): For each of the k hashes of x, set bits[h_i(x)] = 1.
  3. MightContain(x): If all those positions are 1 → maybe; if any is 0 → no.

Quick Python example (minimal, no third-party libs)

import hashlib
from math import log, ceil

class BloomFilter:
    def __init__(self, n_items, fp_rate=0.01):
        # Sizing formulas:
        # m ≈ -(n * ln(p)) / (ln(2)^2),  k ≈ (m/n) * ln(2)
        ln2 = log(2)
        m = int(ceil(-(n_items * log(fp_rate)) / (ln2**2)))
        k = max(1, int(round((m / n_items) * ln2)))
        self.m, self.k = m, k
        self.bits = bytearray((m + 7) // 8)

    def _hashes(self, item):
        data = item.encode("utf-8")
        h1 = int.from_bytes(hashlib.sha256(data).digest()[:8], "big")
        h2 = int.from_bytes(hashlib.blake2b(data, digest_size=16).digest()[:8], "big")
        for i in range(self.k):
            # Double hashing: h_i(x) = (h1 + i*h2) % m
            yield (h1 + i * h2) % self.m

    def _setbit(self, idx):
        self.bits[idx // 8] |= 1 << (idx % 8)

    def _getbit(self, idx):
        return (self.bits[idx // 8] >> (idx % 8)) & 1

    def add(self, item):
        for h in self._hashes(item):
            self._setbit(h)

    def might_contain(self, item):
        return all(self._getbit(h) for h in self._hashes(item))

# Demo
bf = BloomFilter(n_items=1_000_000, fp_rate=0.01)
for name in ["alice", "bob", "charlie"]:
    bf.add(name)

print(bf.might_contain("bob"))      # True (likely present)
print(bf.might_contain("star_coder"))  # False → definitely not present

How to use it for username checks

  • On signup: if not bf.might_contain(username): → show “available (subject to final check)”.
  • Else → do a DB lookup to confirm.

How big should your Bloom filter be?

Use these sizing rules of thumb (rounded):

  • Formulas (handy):
    • m (bits)-(n * ln(p)) / (ln 2)^2
    • k (hash count)(m/n) * ln 2
    • Where n = expected items, p = target false-positive rate.
  • Examples:
    • 1 million usernames, 1% FPR → ~1.14 MB, k ≈ 7.
    • 1 million, 0.1% FPR → ~1.71 MB, k ≈ 10.
    • 50 million, 1% FPR → ~57 MB, k ≈ 7.
    • 50 million, 0.1% FPR → ~86 MB, k ≈ 10.

Pick a p your product can tolerate (e.g., 1% or 0.1%). Lower p = fewer false positives, a bit more memory and more hashes.


Production tips & best practices

  • Hashes: Use fast, stable non-crypto hashes (e.g., Murmur/xxHash) or double-hashing trick from two base hashes.
  • Normalization: Lowercase usernames; normalize Unicode so "José" and "José" behave the same.
  • Updates:
    • Adds are easy: set more bits.
    • Deletes need Counting Bloom filters or periodic rebuilds.
  • Rebuild cadence: For very large sets, rebuild every few minutes/hours via a job, then hot-swap.
  • Split by shard: One filter per region/tenant can cut memory and increase cache locality.
  • Edge checks: Keep a small “recently created usernames” exact set next to the filter to smooth over rebuild delays.
  • Guardrails: Still do final exact DB/Key-Value check before reserving a username.

Pros & cons

Pros

  • Blazing fast checks (a few hashes).
  • Ultra-compact memory footprint.
  • Simple to distribute and cache.

Cons

  • False positives exist (tunable).
  • No native deletes (without counting).
  • Not for queries like “list all items” (it can’t enumerate contents).
bottle

Grab This Deal On Amazon

Leave a Comment

Your email address will not be published. Required fields are marked *