faiss倒排索引原理倒排索引算法

转载

mob64ca14144dde 2024-02-20 13:23:01

文章标签 faiss倒排索引原理算法合并倒排索引关联矩阵 文章分类 数据仓库大数据

1 词项-文档关联矩阵：

在构建倒排索引之前，一个在大规模文档集中进行查找的方法是建立词项-文档关联矩阵，行为每个词项对应的文档向量，而列为每个文档对应的此项向量。根据布尔检索式，进行向量间的位运算（与、或、取反）等得到检索结果。但是这种矩阵在大规模文档条件下，是十分稀疏的，这样造成了极大的空间浪费，在词典空间很大的情况下，每篇文档如果平均包含1000个词，有50万的词项，即使这个文档对应的词项向量有全部的1000个1，那也意味着有499000/500000个0。因此，可以考虑只保存1，这就是倒排索引的基本思想。

2 倒排索引（概括）：

建立一个倒排索引大致包括如下四个步骤：1. 搜集需要建立索引的文档； 2. 词条化； 3. 语言学处理； 4. 根据所有词项建立索引，包括一部词典和一个倒排记录表。

3 布尔查询的简单处理：

首先是倒排记录的结构：

public static class DocNodeList {
		//指向下一个节点
		private DocNodeList next;
		//当前node的文档id值
		private int docID;
		public DocNodeList() {
			super();
		}
		public DocNodeList(int docID) {
			super();
			this.docID = docID;
		}
		public DocNodeList next() {
			return next;
		}
		public DocNodeList getNext() {
			return next;
		}
		public void setNext(DocNodeList next) {
			this.next = next;
		}
		public boolean hasNext() {
			return next != null;
		}
		public int getDocID() {
			return docID;
		}
		public void setDocID(int docID) {
			this.docID = docID;
		}
		//添加操作，设置插入到当前节点的下一个位置
		public void add(DocNodeList node) {
			DocNodeList temp = this.next();
			this.setNext(node);
			//重置插入node的next值，这要求在执行add()操作时要先缓存被插入节点的next引用
			node.setNext(temp);
		}
	}

3.1 两个倒排记录的简单合并算法：在词典中分别定位两个词项，得到其倒排记录进行合并；

/**
	 * 两个倒排记录表的合并算法
	 * @param p1 第一条倒排索引
	 * @param p2 第二条倒排索引
	 * @return
	 */
	public DocNodeList intersect(DocNodeList p1, DocNodeList p2) {
		DocNodeList answer = new DocNodeList(0);
		while(p1.hasNext() && p2.hasNext()) {
			if(p1.getDocID() == p2.getDocID()) {
				DocNodeList temp = p1.next();
				answer.add(p1);
				p1 = temp;
				p2 = p2.next();
			} else if(p1.getDocID() < p2.getDocID()) {
				p1 = p1.next(); 
			} else {
				p2 = p2.next();
			}
		}
		return answer;
	}

对于多个and连接查询，可以进行查询优化，记录少的先合并。具体的过程如下：

public DocNodeList Intersect(ArrayList<DocNodeList> postings) {
		//这里选择ArrayList作为容器，排序使用comparator，Collections.sort()实现
		ArrayList<DocNodeList> terms = sortByIncreasingFrequency(postings);
		//取出排序后的第一个
		DocNodeList result = terms.get(0);
		//terms取出余下的DocNodeList
		terms = rest(terms);
		while(!terms.isEmpty() && result.hasNext()) {
			//posting()取出terms中第一个DocNodeList返回，利用上文的算法进行合并
			result = this.intersect(result, posting(terms));
			terms = rest(terms);
		}
		return result;
	}

3.2 基于跳表的倒排记录快速合并算法：

简单的来说就是为了提高next操作的跨度，提高线性查找的速度，从时空复杂度上来看，增加了一个skip域的空间来提高的是多项式的系数部分。

public DocNodeList IntersectWithSkip(DocNodeList p1, DocNodeList p2) {
		DocNodeList answer = new DocNodeList(0);
		DocNodeList temp = null;
		if(p1.getDocID() == p2.getDocID()) {
			temp = p1.next();
			answer.add(p1);
			p1 = temp;
			p2 = p2.next();
		} else if(p1.getDocID() < p2.getDocID()) {
			if(p1.hasSkip() && p1.getSkip().getDocID() <= p2.getDocID()) {
				while(p1.getSkip().getDocID() <= p2.getDocID()) {
					p1 = p1.getSkip();
				}
			} else {
				p1 = p1.next();
			}
		} else {
			if(p2.hasNext() && p2.getSkip().getDocID() <= p1.getDocID()) 
				while(p2.getSkip().getDocID() <= p1.getDocID())
					p2 = p2.getSkip();
			else
				p2 = p2.next();
		} 
		return answer;
	}

注意：跳表中一次跳多少，可以用一条倒排记录节点数的根号，也可以选择斐波那契数列来确定，具体效率，还要看进行比较的倒排记录docID的分布。

3.3 带位置信息的索引

为了更有效的处理短语查询问题，二元词索引显示不够的（二元词索引需要在单元词索引上建立，同时还增加了相应的索引部分），所以考虑将倒排索引中不仅仅只存储docID，还要存储改term（或者token）在这个doc中的每次出现的位置，并按值排序。那可以在DocNodeList中增加一个数据成员：TreeSet<Integer> positions，具体的排序过程就用java封装好的算法(分情况使用)，不再另写。

/**
	 * 基于带位置信息的倒排索引的邻近搜索算法
	 * @param p1
	 * @param p2
	 * @param k
	 * @return
	 */
	public DocNodeList positional_intersect(DocNodeList p1, DocNodeList p2, int k) {
		DocNodeList answer = new DocNodeList(0);
		TreeSet<Integer> temp_list = new TreeSet<Integer>();
		while(p1.hasNext() && p2.hasNext()) {
			if(p1.getDocID() == p2.getDocID()) {
				//重置清空temp_list
				temp_list.clear();
				Iterator<Integer> pos1 = p1.getPositions().iterator();
				Iterator<Integer> pos2 = p2.getPositions().iterator();
				int pp1 = 0;
				int pp2 = 0;
				while(pos1.hasNext()) {
					pp1 = pos1.next();
					if(!pos2.hasNext()) break;
					while(pos2.hasNext()) {
						pp2 = pos2.next();
						if(Math.abs(pp1 - pp2) <= k) {
							temp_list.add(pp1);
						} else if(pp2 - pp1 > k)//注意这里的break的条件，因为内循环式遍历pos2并且position是升序排列的，所以只有在pp2-pp1>k的时候才能确定pp1不可能和pp2和之后的pos2的值相匹配
							break;
					}
					//在temp_list中包含了
					while(!temp_list.isEmpty() && Math.abs(temp_list.first() - pp2) > k) {
						temp_list.pollFirst();
					}
					//将和pp1匹配的temp_list中的pp2放入结果集中
					for(Integer i : temp_list) 
						add(answer, pp1, i);
					
				}
				p1 = p1.next();
				p2 = p2.next();
			} else if(p1.getDocID() < p2.getDocID()) 
				p1 = p1.next();
			else 
				p2 = p2.next();
		} 
		return answer;
	}

虽然，这个算法包含了两层循环但实际上基于已排序的链表操作，时间复杂度为O(n+m)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。