Cascading Bloom Filters

Bloom filters are an efficient way to check set inclusion for large sets, but they can give false positives. In cases where all testable items (both including and excluding the inserted subset) belong to a known finite set, it's possible to create a relatively large sparsely populated bloom filter that is known to be free of false positives. Instead, a much smaller set of "cascading" bloom filters can be constructed where the known false positives from one bloom filter are used as the finite set of known testable data for another smaller bloom filter. These successive "layers" of bloom filters become smaller as the known false positives are whittled away to zero by the final bloom filter.

This page allows testing bloom filter parameters to see how many layers are needed to eliminate false positives. The sizes of a single-layer and multi-layer bloom filters can be compared, which shows that cascading bloom filters are often (always?) smaller in total size.

See:

items in set to test
of items to insert into level 1 bloom filter
level 1 bloom filter bits
bits set per item using separate hash functions
PRNG seed value

Successive chunks of the SHA-256 hash of each item are used as separate "hash functions" that point to the individual bits to flip to 1s in the bloom filters, as described in the Security Now 989 show notes linked above.

If any false positives are found for any "level," another smaller bloom filter is created and populated with that level's set of false positives. Each level uses 1 to 3 fewer bits per hash function so each successive bloom filter is one half, one quarter, or one eighth the size of the previous level.

Depending on settings, many false positives may be produced. This may cause quite a few levels of bloom filters to be created, and may require a few seconds to run.

Test data is generated based on the seed value, allowing for repeat tests.