python倒序输出列表 python倒序输入

转载

mob64ca1403528a 2023-08-21 13:41:30

文章标签 python倒序输出列表 python单词倒排倒排索引 jar Python 文章分类 Python 后端开发

倒排索引（Inverted index）是信息检索中常用的一种索引方式。在Python中使用dict和set的数据结构就可以实现一个简单的倒排索引。
1 英文单词中字母的倒排索引
一般来讲Linux系统下的 "/usr/share/dict"目录下会有一些单词列表，你也可以在GitHub上下载一个包含大量英语单词的文本文件。由于我的电脑上安装了nltk，在nltk.corpus.words.words()中包含了236736个英语单词。我们用以下方法创建字母和单词的倒排索引，每个字母都对应着一系列包含该字母的单词。
from nltk.corpus import words
from collections import defaultdict
inverted_index = defaultdict(set)
wordlist = words.words()
for word in wordlist:
for letter in word.lower():
inverted_index[letter].add(word)
在这个索引中我们忽略了大小写，运行之后可以先看一看索引的数量和含有字母"a"和"j"的单词的数量。
>>> len(inverted_index)
27
>>> len(inverted_index["a"])
145409
>>> len(inverted_index["j"])
3127
索引中除了有26个字母之外还包含了"-"。除此之外，上面的查询还告诉我们的词库中有145409个不同的单词包含字母"a"或者"A"，但只有3127个单词包含字母"j"或者"J"。如果想找出同时含有"a"和"j"的单词的数量，则我们需要找出两个集合的交集。
>>> len(inverted_index["a"] & inverted_index["j"])
1800
紧接着我们来看看这些单词中的前5个。
>>> sorted(inverted_index["a"] & inverted_index["j"])[:5]
[u'Ajaja', u'Ajatasatru', u'Ajuga', u'Alejandro', u'Alpujarra']
sorted是python的内置函数，但是它会对所以1800个词进行排序，为了找出前5个最小的，使用heapq.nsmallest会更好。
>>> heapq.nsmallest(5, inverted_index["a"] & inverted_index["j"])
[u'Ajaja', u'Ajatasatru', u'Ajuga', u'Alejandro', u'Alpujarra']
想知道有没有单词同时含有"aeiouyfv"这8个字母？
>>> set.intersection(*(inverted_index[c] for c in "aeiouyfv"))
{u'effluviography', u'overfaithfully', u'overfastidiously', u'overpainfully'}
同样也可以用集合相减的操作实现一些查询，例如"q"和"u"在英文中常常一起出现，只有17个单词含有"q"但不含"u"。
>>> len(inverted_index["q"] - inverted_index["u"])
17
1.1 求交集时的顺序
先来看看如下两个语句的运行时间：
>>> timeit set.intersection(*(inverted_index[c] for c in "aje"))
1000 loops, best of 3: 366 µs per loop
>>> timeit set.intersection(*(inverted_index[c] for c in "aej"))
10 loops, best of 3: 23.1 ms per loop
我们发现在求交集时set的顺序的不同对性能的影响非常大，这是由于Python的set.intersection()实际上是如下这样两两相交进行的。
def intersection(*args):
left = args[0]
# Perform len(args)-1 pairwise-intersections
for right in args[1:]:
# Tests take O(N) time, so minimize N by choosing the smaller set
if len(left) > len(right):
left, right = right, left
# Do the pairwise intersection
result = set()
for element in left:
if element in right:
result.add(element)
left = result # Use as the start for the next intersection
return left
顺序为"aje"时，"aj"相交后留下1800个结果，然而顺序为"aej"时，"ae"相交后留下92269个结果。所以前者速度明显较快。如何得到最优的排列顺序往往不是一个简单的问题。
2 使用整数索引代替字符串
可以发现将索引中的单词改为该单词在wordlist中的位置可以减小内存的消耗，具体如下。
from nltk.corpus import words
from collections import defaultdict
inverted_index = defaultdict(set)
wordlist = words.words()
for i,word in enumerate(wordlist):
for letter in word.lower():
inverted_index[letter].add(i)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。