布隆过滤器 java实现布隆过滤器过滤url

转载

mob6454cc66e0d5 2023-08-08 14:21:03

文章标签 布隆过滤器 java实现爬虫 python 散列函数存储空间 文章分类 Java 后端开发

也可以看看这篇文章：

常见URL过滤方法

1 直接查询比较

即假设要存储url A，在入库前首先查询url库中是否存在 A,如果存在，则url A 不入库，否则存入url库。这种方法准确性高，但是一旦数据量变大，占用的存储空间也变大，同时，由于要查库，数据一多，查询时间变长，存储效率下降。

2 基于hash的存储

对于给定的url，通过建立的hash函数，来获得对应的hash值，并将该值存入库中。当在检查url是否存在库中时，只要将要检查的url，通过hash函数获取其hash值，然后查看库中是否存在该hash值，存在则丢弃，否则入库。这种方法在数据量变大时，占用的存储空间也会增大，查询时间也会加长。但是它可以将url进行压缩。对于很长的url，hash值可以相对很短。

以上方法中，为加快查询速度，一般可以选择 Redis作为查询库。

布隆过滤器 Bloom Filter

布隆过滤器 java实现布隆过滤器过滤url_散列函数

基本思路（网上一大堆）：
1，设数据集合 A={a1,a2,....,an}，含n个元素，作为待操作的集合。

m的位向量V表示的集合中的元素，位向量初始值全为0。

k个具有均匀分布特性的散列函数 h1,h2,....,hk.

随机数h1,h2,......hk，使向量V的相应位置h1,h2,......hk均置为1。集合中其他元素也通过类似的操作，将向量V的若干位置为1。

随机数h1,h2,......hk，然后查看向量V的相应位置h1,h2,......hk上的值，若全为1，则该元素已经在之前的集合中；若至少有一个0存在，表明，此元素不在之前的集合中，为新元素。

算法特点：
对于已经在集合中的元素，通过5中的查找方法，一定可以判定该元素在集合中。
对于不在集合中的元素，可能会被误判在集合中。

其整个流程可以参照以下伪代码：

classBloom:
     """ Bloom Filter """
    def __init__(self,m,k,hash_fun):
    """m, size of the vector
    k, number of hash fnctions to compute
    hash_fun, hash function to use
    """ 
    self.m = m
    self.vector =[0]*m # initialize the vectorself.k = k
    self.hash_fun = hash_fun
    self.data ={}
    # data structure to store the data
    self.false_positive =0

    def insert(self,key,value):
    """ insert the pair (key,value) in the database """
        self.data[key]= value
        for i in range(self.k):
            self.vector[self.hash_fun(key+str(i))%self.m]=1

    def contains(self,key):
    """ check if key is cointained in the database
        using the filter mechanism """
        for i in range(self.k):
            if self.vector[self.hash_fun(key+str(i))%self.m]==0:
                return False
                # the key doesn't existreturnTrue# the key can be in the data set
    def get(self,key):
    """ return the value associated with key """
    if self.contains(key):
        try:returnself.data[key]# actual lookup
        except KeyError: self.false_positive +=1

参数计算：
在Bloom Filter表示方法中，对于一元素，某一位被置1的概率为1m，为0的概率为1−1m,散列函数执行k次，对于集合中所有元素，执行次数kn次，所以在运算结束时，某位仍为0 的概率为：

P0=(1−1m)kn

由于

limx→∞(1−1x)−x≈e

有

P0=e−knm

因此误判概率为：

P=(1−P0)k=(1−e−knm)k

由以上式可知，m增大，P减小；对于给定的n和m，求最小的P值。
将上式取对数有：

lnP=kln(1−e−knm)

对k求导：

1P∂P∂k=ln(1−e−knm)+k1(1−e−knm)(−1)(e−knm)(−nm)

令

∂P∂k=0

有：

ln(1−e−knm)∗(1−e−knm)=(−knm)＊e−knm

可以发现，在上式中等号的任意一边，后一个数取对数刚好是前一个数，所以等号左右两边结构相似。则有：

(1−e−knm)=e−knm

求得：

12=e−knm

求得：

k=mnln2≈0.7mn

时，误判率最小

python3中pybloom

自己使用的是python3.5，而官网pypi上下载的pybloom是时候python2版本的。所以直接安装pip install pybloom 或者Python setup.py install会出错。自己将文件中部分内容修改后就可以正常安装（python setup.py install）与运行。如何使用，参考：http://axiak.github.io/pybloomfiltermmap/

Welcome to Python BloomFilter’s documentation!

If you are here, you probably don’t need to be reminded about the nature of a Bloom filter. If you need to learn more, just visit the wikipedia page to learn more. This module implements a Bloom filter in python that’s fast and uses mmap files for better scalability. Did I mention that it’s fast?

Here’s a quick example:

from pybloomfilter import BloomFilter

bf = BloomFilter(10000000, 0.01, 'filter.bloom')

with open("/usr/share/dict/words") as f:
    for word in f:
        bf.add(word.rstrip())

print 'apple' in bf
#outputs True

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。