倒排索引存储倒排索引 tfidf 关系

转载

墨染青丝 2024-04-29 05:32:19

文章标签 倒排索引存储机器学习搜索倒排索引 github 文章分类 数据仓库大数据

多模态搜索：图片、视频、文本都转成数值表示，计算相关性。

搜索方式

倒排索引：在第一次拿到所有材料时，把它们通读一遍，然后构建关键词和文章的对应关系。当用户在搜索特定词的时候，比如“红”，就会直接返回“红”这个【关键词索引】下的文章列表。

通过倒排索引找到的文章可能依然是海量。如果能有种方法对这些文章进行排序操作，再选取排名靠前的文章列表也能帮我们节省大量的时间。处理匹配排序，最有名的算法之一叫做TF-IDF。

TF（term frequency）与 IDF（inverse document frequency）

寻找文章的关键词

在某一文章中出现次数多
仅在某一文章中出现次数多（能够代表某一文章）

TF-IDF 两者结合其实就是两者相乘的意思，这样的结果意味着所有的文章，都能用一串集合所有词的分数来表示。通过分数的高低，我们也能大概看出这篇文章的关键内容是什么。比如第一篇，虽然TF告诉我们文章中“爱”这个字最多，但是IDF却告诉我们“莫烦”在文章中更具有代表性，所以根据TF-IDF的结合，这篇文章更具有“莫烦”属性。

搜索相似的文章

计算搜索词的TF-IDF值
计算搜索词和每篇文章的cosine距离

就能找到搜索问题的最佳匹配文章，都是将问句或文章转变成向量，然后按照向量的模式指向空间中的某个位置。虽然没有用深度学习，但是TF-IDF何尝不也是一种向量表达形式呢？

TF-IDF

其实它就是一个庞大的矩阵，用词语的数字向量来代表一篇文档，当比较文档时，就是在比较这些向量的相似性。
在计算TF-IDF时，我们会拆开来计算，因为其实IDF的值是可以复用的，当我们计算过一个庞大数据库中的IDF后，可以把带有这个数据库数据分布的IDF拿到其他地方使用，有点像机器学习当中的与训练模型的意思。
最终计算出来的TF-IDF实际是一个词语和文章的矩阵，代表着用词语向量表示的文章。
有了这些向量表示，当我们进行搜索时，只需要将搜索的问题向量化，把搜索向量在文档向量上进行距离计算，就能算出来哪些文档更贴近搜索向量了。
TF-IDF就是一张将词语重要程度转换成向量的文档展示方式，那么在这些向量中，必定会有主导型元素，而这些元素其实就是这篇文档中很重要的关键词了，即tf-idf另一个用途。
reference：

余弦相似度
itertools.chain()函数：对多个对象执行相同的操作且这几个对象在不同的容器中,这时候就用到该函数。
collections.Counter类
np.argsort()函数：将X中的元素从小到大排序后，提取对应的索引index，然后输出到y
np.ravel() 将多维数组降为一维，np.flatten(0返回一份拷贝，对拷贝所做修改不会影响原始矩阵，而np.ravel()返回的是视图，修改时会影响原始矩阵

python基础

a = [0,1,2,3,4,5,6]
print(a[-4:][::-1])
#[6, 5, 4, 3]

fun = lambda x:x*np.sum(x)
a = np.array([1,2,3,4])
b = fun(a)
print(b)
#[10 20 30 40]

c = Counter('iloveyouifallinlovewithyou')
print(c.most_common(1))
#[('i', 4)]
counter = Counter(['cat','cat','dog','pig'])
print(counter)
#Counter({'cat': 2, 'dog': 1, 'pig': 1})

例子，代码

尝试搜索 I get a coffee cup, 并返回15篇文档当中最像这句搜索的前3篇文档。

计算余弦相似度，如果numpy不熟悉，需要花点时间才看明白

倒排索引存储倒排索引 tfidf 关系_搜索

import numpy as np
from collections import Counter
import itertools

docs = [
    "it is a good day, I like to stay here",
    "I am happy to be here",
    "I am bob",
    "it is sunny today",
    "I have a party today",
    "it is a dog and that is a cat",
    "there are dog and cat on the tree",
    "I study hard this morning",
    "today is a good day",
    "tomorrow will be a good day",
    "I like coffee, I like book and I like apple",
    "I do not like it",
    "I am kitty, I like bob",
    "I do not care who like bob, but I like kitty",
    "It is coffee time, bring your cup",
]

docs_words = [d.replace(",", "").split(" ") for d in docs]
vocab = set(itertools.chain(*docs_words))
v2i = {v: i for i, v in enumerate(vocab)}
i2v = {i: v for v, i in v2i.items()}

def safe_log(x):
    mask = x != 0
    x[mask] = np.log(x[mask])
    return x


tf_methods = {
        "log": lambda x: np.log(1+x),
        "augmented": lambda x: 0.5 + 0.5 * x / np.max(x, axis=1, keepdims=True),
        "boolean": lambda x: np.minimum(x, 1),
        "log_avg": lambda x: (1 + safe_log(x)) / (1 + safe_log(np.mean(x, axis=1, keepdims=True))),
    }
idf_methods = {
        "log": lambda x: 1 + np.log(len(docs) / (x+1)),
        "prob": lambda x: np.maximum(0, np.log((len(docs) - x) / (x+1))),
        "len_norm": lambda x: x / (np.sum(np.square(x))+1),
    }


def get_tf(method="log"):
    # term frequency: how frequent a word appears in a doc
    _tf = np.zeros((len(vocab), len(docs)), dtype=np.float64)    # [n_vocab, n_doc]
    for i, d in enumerate(docs_words):
        counter = Counter(d)
        for v in counter.keys():
            _tf[v2i[v], i] = counter[v] / counter.most_common(1)[0][1]

    weighted_tf = tf_methods.get(method, None)
    if weighted_tf is None:
        raise ValueError
    return weighted_tf(_tf)


def get_idf(method="log"):
    # inverse document frequency: low idf for a word appears in more docs, mean less important
    df = np.zeros((len(i2v), 1))
    for i in range(len(i2v)):
        d_count = 0
        for d in docs_words:
            d_count += 1 if i2v[i] in d else 0
        df[i, 0] = d_count

    idf_fn = idf_methods.get(method, None)
    if idf_fn is None:
        raise ValueError
    return idf_fn(df)

def cosine_similarity(q, _tf_idf):
    unit_q = q / np.sqrt(np.sum(np.square(q), axis=0, keepdims=True)) #[n_vocab, 1]
    unit_ds = _tf_idf / np.sqrt(np.sum(np.square(_tf_idf), axis=0, keepdims=True)) #[n_vocab, n_doc] / [1, n_doc]
    similarity = unit_ds.T.dot(unit_q).ravel() #[n_doc,1]
    return similarity


def docs_score(q, len_norm=False):
    q_words = q.replace(",", "").split(" ")

    # add unknown words
    unknown_v = 0
    for v in set(q_words):
        if v not in v2i:
            v2i[v] = len(v2i)
            i2v[len(v2i)-1] = v
            unknown_v += 1
    if unknown_v > 0:
        _idf = np.concatenate((idf, np.zeros((unknown_v, 1), dtype=np.float)), axis=0)
        _tf_idf = np.concatenate((tf_idf, np.zeros((unknown_v, tf_idf.shape[1]), dtype=np.float)), axis=0)
    else:
        _idf, _tf_idf = idf, tf_idf
    #扩充临时字典    
        
    counter = Counter(q_words)
    q_tf = np.zeros((len(_idf), 1), dtype=np.float)     # [n_vocab, 1]
    for v in counter.keys():
        q_tf[v2i[v], 0] = counter[v]

    q_vec = q_tf * _idf            # [n_vocab, 1]
    print("\nq_vec",q_vec.shape)
    print("\n_tf_idf",_tf_idf.shape)
    q_scores = cosine_similarity(q_vec, _tf_idf)
    print("\nq_scores",q_scores.shape)
    if len_norm:
        len_docs = [len(d) for d in docs_words]
        q_scores = q_scores / np.array(len_docs)
    return q_scores


def get_keywords(n=2):
    for c in range(3):
        col = tf_idf[:, c]
        idx = np.argsort(col)[-n:]
        print("doc{}, top{} keywords {}".format(c, n, [i2v[i] for i in idx]))

tf = get_tf()           # [n_vocab, n_doc]
idf = get_idf()         # [n_vocab, 1]
tf_idf = tf * idf       # [n_vocab, n_doc]

print("tf shape(vecb in each docs): ", tf.shape)
#print("\ntf samples:\n", tf[:2])
print("\nidf shape(vecb in all docs): ", idf.shape)
#print("\nidf samples:\n", idf[:2])
print("\ntf_idf shape: ", tf_idf.shape)
#print("\ntf_idf sample:\n", tf_idf[:2])

# 关键词
get_keywords()

# 计算相似度
q = "I get a coffee cup"
scores = docs_score(q)
d_ids = scores.argsort()[-3:][::-1]
print("\ntop 3 docs for '{}':\n{}".format(q, [docs[i] for i in d_ids]))

结果：

runfile('E:/spyder/tf-idf.py', wdir='E:/spyder')
tf shape(vecb in each docs):  (47, 15)

idf shape(vecb in all docs):  (47, 1)

tf_idf shape:  (47, 15)
doc0, top2 keywords ['to', 'stay']
doc1, top2 keywords ['be', 'happy']
doc2, top2 keywords ['am', 'bob']

q_vec (48, 1)

_tf_idf (48, 15)

q_scores (15,)

top 3 docs for 'I get a coffee cup':
['It is coffee time, bring your cup', 'I like coffee, I like book and I like apple', 'I have a party today']

def show_tfidf(tfidf, vocab, filename):
    # [n_doc, n_vocab]
    plt.imshow(tfidf, cmap="YlGn", vmin=tfidf.min(), vmax=tfidf.max())
    plt.xticks(np.arange(tfidf.shape[1]), vocab, fontsize=6, rotation=90)
    plt.yticks(np.arange(tfidf.shape[0]), np.arange(1, tfidf.shape[0]+1), fontsize=6)
    plt.tight_layout()
    plt.savefig("./result/%s.png" % filename, format="png", dpi=500)
    plt.show()
show_tfidf(tf_idf.T, [i2v[i] for i in range(tf_idf.shape[0])], "tfidf_matrix")

倒排索引存储倒排索引 tfidf 关系_倒排索引存储_02

sk-learn版本

https://mofanpy.com/tutorials/machine-learning/nlp/tfidf-more/ 自动计算TF-IDF和余弦距离。
使用sparse matric技术保存矩阵，只保存有值的位置，更节省空间。tf_idf.todense()变回稠密矩阵。
fit_transform(docs)帮你去掉标点符号，生成字典。

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
  
import matplotlib.pyplot as plt
def show_tfidf(tfidf, vocab, filename):
    # [n_doc, n_vocab]
    plt.imshow(tfidf, cmap="YlGn", vmin=tfidf.min(), vmax=tfidf.max())
    plt.xticks(np.arange(tfidf.shape[1]), vocab, fontsize=6, rotation=90)
    plt.yticks(np.arange(tfidf.shape[0]), np.arange(1, tfidf.shape[0]+1), fontsize=6)
    plt.tight_layout()
    plt.savefig("./result/%s.png" % filename, format="png", dpi=500)
    plt.show()


docs = [
    "it is a good day, I like to stay here",
    "I am happy to be here",
    "I am bob",
    "it is sunny today",
    "I have a party today",
    "it is a dog and that is a cat",
    "there are dog and cat on the tree",
    "I study hard this morning",
    "today is a good day",
    "tomorrow will be a good day",
    "I like coffee, I like book and I like apple",
    "I do not like it",
    "I am kitty, I like bob",
    "I do not care who like bob, but I like kitty",
    "It is coffee time, bring your cup",
]

vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(docs)
print("idf: ", [(n, idf) for idf, n in zip(vectorizer.idf_, vectorizer.get_feature_names())])
print("v2i: ", vectorizer.vocabulary_)


q = "I get a coffee cup"
qtf_idf = vectorizer.transform([q])
res = cosine_similarity(tf_idf, qtf_idf)
res = res.ravel().argsort()[-3:]
print("\ntop 3 docs for '{}':\n{}".format(q, [docs[i] for i in res[::-1]]))


i2v = {i: v for v, i in vectorizer.vocabulary_.items()}
dense_tfidf = tf_idf.todense()
show_tfidf(dense_tfidf, [i2v[i] for i in range(dense_tfidf.shape[1])], "tfidf_sklearn_matrix")

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。