- 经典的哈希函数都有无限大的输入值域(无穷大)。
- 经典的哈希函数的输出域都是固定的范围(有穷大,假设输出域为S)
- 当给哈希函数传入相同的值时,返回值必一样
- 当给哈希函数传入不同的输入值时,返回值可能一样,也可能不一样。
- 输入值会尽可能均匀的分布在S上
至此一个输入对象对bit array集合的影响过程就结束了,我们可以看到会有多个位置被描黑,也就是设置为1.接下来所有的输入对象都按照这种方式去描黑数组,最终一个布隆过滤器就生成了,它代表了所有输入对象组成的集合。
那么如何判断一个对象是否在过滤器中呢?假设一个输入对象为hash1,我们需要通过看k个哈希函数算出k个值,然后把k个值取余(%m),就得到了k个[0,m-1]的值。然后我们判断bit array上这k个值是否都为黑,如果有一个不为黑,那么肯定hash1肯定不在这个集合里。如果都为黑,则说明hash1在集合里,但有可能误判。因为当输入对象过多,而集合过小,会导致集合中大多位置都会被描黑,那么在检查hash1时,有可能hash1对应的k个位置正好被描黑了,然后错误的认为hash1存在集合里。
如果bit array集合的大小m相比于输入对象的个数过小,失误率就会变高。这里直接引入一个已经得到证明的公式,根据输入对象数量n和我们想要达到的误判率为p计算出布隆过滤器的大小m和哈希函数的个数k.
Hash函数的个数K,,可得K= 0.7*m/n,可约得13个,那么真实失误率p=6/十万。
- 引入依赖
<!-- https://mvnrepository.com/artifact/com.google.guava/guava -->
- 核心API
1 /**
2 * Creates a {@link BloomFilter BloomFilter<T>} with the expected number of
3 * insertions and expected false positive probability.
4 *
5 * <p>Note that overflowing a {@code BloomFilter} with significantly more elements
6 * than specified, will result in its saturation, and a sharp deterioration of its
7 * false positive probability.
8 *
9 * <p>The constructed {@code BloomFilter<T>} will be serializable if the provided
10 * {@code Funnel<T>} is.
11 *
12 * <p>It is recommended that the funnel be implemented as a Java enum. This has the
13 * benefit of ensuring proper serialization and deserialization, which is important
14 * since {@link #equals} also relies on object identity of funnels.
15 *
16 * @param funnel the funnel of T's that the constructed {@code BloomFilter<T>} will use
17 * @param expectedInsertions the number of expected insertions to the constructed
18 * {@code BloomFilter<T>}; must be positive
19 * @param fpp the desired false positive probability (must be positive and less than 1.0)
20 * @return a {@code BloomFilter}
21 */
22 public static <T> BloomFilter<T> create(
23 Funnel<T> funnel, int expectedInsertions /* n */, double fpp) {
24 checkNotNull(funnel);
25 checkArgument(expectedInsertions >= 0, "Expected insertions (%s) must be >= 0",
26 expectedInsertions);
27 checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
28 checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
29 if (expectedInsertions == 0) {
30 expectedInsertions = 1;
31 }
32 /*
33 * TODO(user): Put a warning in the javadoc about tiny fpp values,
34 * since the resulting size is proportional to -log(p), but there is not
35 * much of a point after all, e.g. optimalM(1000, 0.0000000000000001) = 76680
36 * which is less than 10kb. Who cares!
37 */
38 long numBits = optimalNumOfBits(expectedInsertions, fpp);
39 int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
40 try {
41 return new BloomFilter<T>(new BitArray(numBits), numHashFunctions, funnel,
42 BloomFilterStrategies.MURMUR128_MITZ_32);
43 } catch (IllegalArgumentException e) {
44 throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
45 }
46 }
1 /**
2 * Returns {@code true} if the element <i>might</i> have been put in this Bloom filter,
3 * {@code false} if this is <i>definitely</i> not the case.
4 */
5 public boolean mightContain(T object) {
6 return strategy.mightContain(object, funnel, numHashFunctions, bits);
7 }
- 例子
1 public static void main(String... args){
2 /**
3 * 创建一个插入对象为一亿,误报率为0.01%的布隆过滤器
4 */
5 BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charset.forName("utf-8")), 100000000, 0.0001);
6 bloomFilter.put("121");
7 bloomFilter.put("122");
8 bloomFilter.put("123");
9 System.out.println(bloomFilter.mightContain("121"));
10 }