Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/bloom.html
For future reference (mostly for myself), here’s a little summary of how to
use Bloom filters in
real world applications.
Most references are terse and vague on how to pick the hash functions for
bloom filters, so here’s some detail about that: For small filters, just use a
boring and fast hash function like the djb hash
function and split up the 32bit result into smaller independent chunks for
each of the k hash indexes you’ll need. Often those 32 bits already provide
enough hash bits to get enough independent bloom filter indexes. And if they
don’t you basically have three options:
- Use multiple different hash functions, and then MurmurHash seems to be a
very good choice. It’s simple, readily usable code (even in C, though the
reference implementation claims to be C++), and properly licensed. It is a hash
function that takes a seed parameter which can be used to create as many
independent hash functions as needed. - Use a cryptographic hash function. Most of them can be implemented really
fast on modern CPUs and are already available in some library you use anyway.
SHA512 for example outputs plenty bits you can split into k chunks as you need
them for your k bloom filter indexes. (Of course, if you are afraid of US
export regulations this might be a choice you want to avoid.) - Use two independent hash functions and combine them
linearly.
The size of the bloom filter and the number of hash functions you should be
using depending on your application can be calculated using the formulas on the
Wikipedia page:
m = -n*ln(p)/(ln(2)^2)
This will tell you the number of bits m to use for your filter, given the
number n of elements in your filter and the false positive probability p you
want to achieve. All that for the ideal number of hash functions k which you
can calculate like this:
k = 0.7*m/n
And that’s already everything you need to know to build good bloom filters.
If you know the p and n for your use case the above will tell you the m and k, and
how to choose the k hash functions.
Bloom filters are a really really useful tool, and given their simplicity
something every developer should be aware of.
(And in case you were wondering what this all is about, Kay Sievers and I
were discussing using bloom filters in the libudev netlink BSD socket filters,
to allow monitoring a certain subset of devices that is orthogonal to the usual
subsystem hierarchy, and all that in a way where the number of wakeups in
listening clients is minimized)