Bloom Filters13 Jan 2015
Check that out here: Analyze Bloom Filter
Bloom Filter are best suited when you have a massive table of data – millions and millions of entries – because they let you make an educated guess about whether a particular value is contained in your datastore before looking it up. So a real world parallel would be to check with your library’s catalogue to see if a book is in stock, before walking to find it.
One advantages of a Bloom Filter is that it’s super space-efficient. The length of the filter,
m, represents the number of bits that can turned on.
One downside of a Bloom Filters is that it’s possible that lookup will give you a false positive result: the prediction model may say “yes, we probably have that key stored”, even when it’s not really there.
So our Bloom Filter Analysis page runs 10,000 tests which:
- Create a random word.
- Check if the Bloom Filter thinks it’s contained.
- Increment a False Positive counter if it wasn’t really contained.
- After the 10,000 trials finish, the false positive rate is displayed as a percentage, along with the theoretical expectation for a bloom filter of that size and density, so you can see how close our implementation stacks up.
What happens in the case of a false positive? Nothing terrible, it just means a lookup for that value will be run on the accompanying hash table, but that lookup will find nothing’s really there.
n variable represents the number of values that have been inserted already. As this number goes up, so does the false positive rate, because there’s more likely to be hashing collisions.
An implementer could make the Bloom Filter even more memory efficient by adding more hashing functions,
k, but this comes at a cost of speed, as each prediction and insertion requires calculating more index bits to turn on.
All in all, it was a fun thing to create, and a nice learning opportunity. Thanks Andy for being such a great partner, and Hack Reactor for the encouragement.