How Bloom Filters Work
A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is possibly in a set or definitely not in a set. It allows for false positives but never false negatives.
One example of application is the membership query in Wide Column Storage, HBase. They make document lookup faster by completely skipping some HFiles.
Structure and Initialization
A Bloom filter consists of:
- A bit array of size $m$, initialized to all 0s.
- $k$ independent hash functions, each mapping an input element to one of the $m$ positions in the bit array.
Common operations
Insertion of an Element
To insert an element $x$:
- Compute $k$ hash values: $h_1(x), h_2(x), ..., h_k(x)$.
- Set the corresponding bits in the bit array to 1.
Example:
If $h_1(x) = 3$, $h_2(x) = 7$, and $h_3(x) = 11$, the bits at positions 3, 7, and 11 are set to 1.
Membership Query
To check if an element $y$ is in the Bloom filter:
- Compute the $k$ hash values: $h_1(y), h_2(y), ..., h_k(y)$.
- If all corresponding bits in the bit array are 1, then:
- The element is possibly in the set (but could be a false positive).
- If any bit is 0, then:
- The element is definitely not in the set.
Interesting properties
False Positives & No False Negatives
- A false positive occurs when all $k$ hash positions of $y$ are already set to 1 due to previous insertions, even though $y$ was never inserted.
- A false negative never occurs because an inserted element always sets its corresponding bits, ensuring it will always be recognized as “possibly in the set.”
Deletion Issue
- Standard Bloom filters do not support deletions because clearing a bit might remove information from multiple elements.
- A variation, called a Counting Bloom Filter, uses counters instead of bits to track the number of insertions at each position.