Java 中使用 Guava 库的布隆过滤器的交集和并集

Intersections and unions of Bloom filters in Java using the Guava library

布隆过滤器似乎很有希望解决我正在处理的现实问题。一个流行的 java 实现似乎是 Google 的 guava 库。

我需要使用 bloom filter 的并集运算,如果有交集运算就好了,但我可以解决它。

在 java 文档中,我找不到执行交集的方法,而 putAll 方法似乎像并集操作一样工作。

所以我的问题是:

  1. putAll() 是获取两个布隆过滤器并集的​​正确方法吗?
  2. 是否有非反射的方式来获取两个或多个布隆过滤器的交集?
  3. 万一反射没问题,我们可以在 'data' 字段上安全地执行双向运算(或,和)吗?

如果有人可以推荐任何其他流行且经过良好测试的库,该库在 maven 的存储库中可用,并且具有交集和联合,那也可以解决我的需求。

  1. Is putAll() the correct way to get the union of two bloom filters?

是的,BloomFilter#putAll(BloomFilter) JavaDoc 说:

Combines this Bloom filter with another Bloom filter by performing a bitwise OR of the underlying data. The mutations happen to this instance. Callers must ensure the Bloom filters are appropriately sized to avoid saturating them.

问题是它会抛出 IllegalArgumentException - if isCompatible(that) == false,所以如果你想使用这个方法,你必须确保 isCompatible returns true:

Determines whether a given Bloom filter is compatible with this Bloom filter. For two Bloom filters to be compatible, they must:

  • not be the same instance
  • have the same number of hash functions
  • have the same bit size
  • have the same strategy
  • have equal funnels

也就是说,让我们尝试回答第二个问题并找到更简单的解决方案。

  1. Is there a non-reflective way to get the intersection of two or more bloom filters?

是的,您可以使用 BloomFilters 实现 Predicate 并使用其 .and().or() 方法这一事实,记住 mightContain 行为取决于关于 BloomFilter 的参数。

示例:

// bloomFilter1: [0, 100], expected size 200, FPP 1%
BloomFilter<Integer> bloomFilter1 = IntStream.rangeClosed(0, 100)
        .boxed()
        .collect(toBloomFilter(Funnels.integerFunnel(), 200L, 0.01d));

// bloomFilter2: [80, 200], expected size 200, FPP 1%
BloomFilter<Integer> bloomFilter2 = IntStream.rangeClosed(80, 200)
        .boxed()
        .collect(toBloomFilter(Funnels.integerFunnel(), 200L, 0.01d));

// bloomFilter2: [80, 200], expected size 1000, FPP 0.1%
BloomFilter<Integer> bloomFilter3 = IntStream.rangeClosed(80, 200)
        .boxed()
        .collect(toBloomFilter(Funnels.integerFunnel(), 1000L, 0.001d));

assertThat(bloomFilter1.isCompatible(bloomFilter2)).isTrue(); // same parameters
assertThat(bloomFilter1.isCompatible(bloomFilter3)).isFalse(); // different parameters

BloomFilter<Integer> bloomFilterUnion = bloomFilter1.copy(); // preserves BF parameters
bloomFilterUnion.putAll(bloomFilter2);
// note that bloomFilterUnion.putAll(bloomFilter3) would throw IAE

// using AssertJ Predicate asserts https://joel-costigliola.github.io/assertj/core-8/api/org/assertj/core/api/PredicateAssert.html

// or == union == putAll
assertThat(bloomFilterUnion).accepts(90);
assertThat(bloomFilter1.or(bloomFilter2)).accepts(90);
assertThat(bloomFilter1.or(bloomFilter3)).accepts(90);
assertThat(bloomFilterUnion).accepts(42);
assertThat(bloomFilter1.or(bloomFilter2)).accepts(42);
assertThat(bloomFilter1.or(bloomFilter3)).accepts(42);
assertThat(bloomFilterUnion).accepts(180);
assertThat(bloomFilter1.or(bloomFilter2)).accepts(180);
assertThat(bloomFilter1.or(bloomFilter3)).accepts(180);
assertThat(bloomFilterUnion).rejects(300);
assertThat(bloomFilter1.or(bloomFilter2)).rejects(300);
assertThat(bloomFilter1.or(bloomFilter3)).rejects(300);

// and == intersection
assertThat(bloomFilter1.and(bloomFilter2)).accepts(90);
assertThat(bloomFilter1.and(bloomFilter3)).accepts(90);
assertThat(bloomFilter1.and(bloomFilter2)).rejects(42);
assertThat(bloomFilter1.and(bloomFilter3)).rejects(42);
assertThat(bloomFilter1.and(bloomFilter2)).rejects(180);
assertThat(bloomFilter1.and(bloomFilter3)).rejects(180);
assertThat(bloomFilter1.and(bloomFilter2)).rejects(300);
assertThat(bloomFilter1.and(bloomFilter3)).rejects(300);
  1. In case one is okay with reflection, can we safely perform biwise operations (or, and) on the 'data' field?

我个人不会弄乱任何 class 的内部结构,但在我看来,这里没有必要这样做。