PaddleNLP根据词向量获取对应的词语词向量匹配

转载

mob64ca1418736f 2023-09-23 13:13:08

文章标签 python 自然语言处理词向量预处理 sed 文章分类 NLP 人工智能

环境准备

python3.6
pandas --读取并处理csv文件
nltk --http://www.nltk.org/ 自然语言处理工具包，用于分词，词干提取，语料库
gensim -- 训练word2vec词向量的工具包

实现过程

导入工具包

import pandas as pd 
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt #可视化数据集时使用
from gensim.models import Word2Vec
import string
import numpy as np
from queue import PriorityQueue
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()#词形还原,去掉单词词缀,提取单词主干

构建问题词集

对数据库中的所有问题的单词进行提取汇总，构造成含有单词和单词出现次数信息的字典。对各个单词出现的次数进行可视化。

数据集的csv文件如下图所示：

PaddleNLP根据词向量获取对应的词语词向量匹配_python

data = pd.read_csv('./qanda.csv')
q=data['question']
question_list=q.values.tolist()#以每个问题文本为元素的列表
a=data['answer']
answer_list=a.values.tolist()#以每个答案文本为元素的列表
#构建问题词集 key:单词 value：频数
word_in_q = {}
for x in question_list:
    for x1 in word_tokenize(x.rstrip('?')):#删除?
        if x1 in word_in_q:
            word_in_q[x1] += 1
        else:
            word_in_q[x1] = 1
word_num = sum(word_in_q.values())
word_different = len(word_in_q.keys())

#可视化
dictionary = dict(sorted(word_in_q.items(),key=lambda x:x[1],reverse=True))#按照value的值从大到小排序
x_axis = list(dictionary.keys())
y_axis = list(dictionary.values())
plt.plot(x_axis[:20],y_axis[:20])
plt.show()

文本预处理

文本预处理主要做以下几个部分：
1.去掉问题集的无用词（nltk的停止词语料库）
包括：{‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’, ‘there’, ‘about’, ‘once’, ‘during’, ‘out’, ‘very’, ‘having’, ‘with’, ‘they’, ‘own’, ‘an’, ‘be’, ‘some’, ‘for’, ‘do’, ‘its’, ‘yours’, ‘such’, ‘into’, ‘of’, ‘most’, ‘itself’, ‘other’, ‘off’, ‘is’, ‘s’, ‘am’, ‘or’, ‘who’, ‘as’, ‘from’, ‘him’, ‘each’, ‘the’, ‘themselves’, ‘until’, ‘below’, ‘are’, ‘we’, ‘these’, ‘your’, ‘his’, ‘through’, ‘don’, ‘nor’, ‘me’, ‘were’, ‘her’, ‘more’, ‘himself’, ‘this’, ‘down’, ‘should’, ‘our’, ‘their’, ‘while’, ‘above’, ‘both’, ‘up’, ‘to’, ‘ours’, ‘had’, ‘she’, ‘all’, ‘no’, ‘when’, ‘at’, ‘any’, ‘before’, ‘them’, ‘same’, ‘and’, ‘been’, ‘have’, ‘in’, ‘will’, ‘on’, ‘does’, ‘yourselves’, ‘then’, ‘that’, ‘because’, ‘what’, ‘over’, ‘why’, ‘so’, ‘can’, ‘did’, ‘not’, ‘now’, ‘under’, ‘he’, ‘you’, ‘herself’, ‘has’, ‘just’, ‘where’, ‘too’, ‘only’, ‘myself’, ‘which’, ‘those’, ‘i’, ‘after’, ‘few’, ‘whom’, ‘t’, ‘being’, ‘if’, ‘theirs’, ‘my’, ‘against’, ‘a’, ‘by’, ‘doing’, ‘it’, ‘how’, ‘further’, ‘was’, ‘here’, ‘than’}
2.全部转为小写
3.去掉低频词（本次未使用）
由于本次实验的问题集很少，因此去掉低频词反而会让结果更加不准确
4.对出现纯数字的情况进行转换
对于纯数字符转成 ‘#number’
5.词干提取（nltk.stem）
忽略每个单词的时态、后缀等变体，我们提取词干的原因是为了缩短查找的时间，使句子正常化。

stopwords = stopwords.words('english')
stopword = ['when','how','what','where','who']
stopwords = [word for word in stopwords if word not in stopword]
#低频词,语料库小的时候不去除
#low_frequency_words = [word for word in word_in_q.keys() if word_in_q[word]<2]
#delete_word = stopwords + low_frequency_words
delete_word = stopwords

def text_processing(question):
    every_question_word = list()
    question = question.lower()#小写
    question = question.strip('?.,!')
    for word in word_tokenize(question):
        word = ''.join(letters for letters in word if letters not in string.punctuation) if word not in string.punctuation else word#去标点
        word = '#number'if word.isdigit() else word#只有数字用
        if word not in delete_word:
            every_question_word.append(lemmatizer.lemmatize(word))#提取每个问句中的单词词干
    return every_question_word

all_question_word = list()#把经过文本预处理的语料库中的问题放入all_quetion_word的list中

训练词向量

model = Word2Vec(all_question_word,min_count=1)

构建单词与单词出现的索引列表构成的字典

统计好每个单词出现在了哪些问题里，有助于接下来根据实际提问筛选出疑似的问题

index_inverted = dict()#构建index_inverted的倒排表,key:单词 value:索引列表
for question in question_list:
    preprocessed_word = text_processing(question)
    all_question_word.append(preprocessed_word)
    for word in preprocessed_word:
        if word not in index_inverted:
            index_inverted.update({word: [question_list.index(question)]})
        else:
            index_inverted[word].append(question_list.index(question))

输入实际提问问题并做预处理

输入实际问题文本之后，通过之前定义的text_processing函数对文本进行预处理，并且将处理后得到的单词列表与index_inverted中的键值单词相对应。得到这些单词对应的疑似数据库问题的索引filter_index。

real_question = "What else is Montreal called?"#可通过语音识别输入
real_question_word = text_processing(real_question)
filter_index = set()
real_question_word_exist = real_question_word[:] #复制列表
# 这里筛去了问题集中没有但是实际提问中出现的单词，便于后续计算余弦相似度
for word in real_question_word:   
    if word not in index_inverted.keys():
        real_question_word_exist.remove(word)
    else:
        filter_index.update(index_inverted[word])

求词向量余弦距离得到匹配的问题及对应答案

#过滤出疑似问题
part_question_word = list()
for index in filter_index:#过滤出疑似问题
    one_question_word = all_question_word[index]
    part_question_word.append(one_question_word)
siml_list = list()
for question_word in part_question_word:
    siml=model.n_similarity(real_question_word_exist,question_word)#计算集合之间的余弦相似度
    siml_list.append(siml)
if max(siml_list)>0.9:#存在相似度大于0.9的问题才给出对应回答
    top_question = PriorityQueue()#优先队列
    for i in range(len(siml_list)):
        top_question.put((siml_list[i],part_question_word[i]))#
        if len(top_question.queue) > 5:#每次top_question.queue超过5个时就踢走最大的元素 和 值
            top_question.get()#得到最大的5个元素和值(未排序)
    top_question = top_question.queue
    print(top_question)
    sorted_top_question = sorted(top_question, key=lambda t: t[0], reverse=True)
    print('sorted_top_question\n',sorted_top_question)#打印出相似度前5的问题

    answers = [answer_list[all_question_word.index(question[1])] for question in sorted_top_question]
    print('answers\n',answers) #打印出问题对应的答案
else:#提问不在问题集的范围内，另做处理
    print("This question not exist in my lib!")

最终输出结果

sorted_top_question
 [(1.0, ['what', 'else', 'montreal', 'called']), (0.43783137, ['what', 'mechanical', 'knight']), (0.3780565, ['what', 'origin', 'comic', 'sans', 'font']), (0.32334372, ['what', 'moravec', 'paradox', 'state']), (0.32260603, ['what', 'nanobot'])]
answers
 ['Montreal is often called the City of Saints or the City of a Hundred Bell Towers.', 'A robot sketch made by Leonardo DaVinci.', "Comic Sans is based on Dave Gibbons' lettering in the Watchmen comic books.", "Moravec's paradox states that a computer can crunch numbers like Bernoulli, but lacks a toddler's motor skills.", 'The smallest robot possible is called a nanobot. ']