NLP TER nlp term紧密度计算

转载

mob6454cc770d06 2024-08-06 18:43:12

文章标签 NLP TER 权重词频守望者 文章分类 NLP 人工智能

文章目录

词频
TFIDF
词在文中位置
文章总长度
词长
词跨度
词性
词与主题的关系
否定句
自然衰减权重
TextRank
内联权重
上下文特征向量
最尾补充一些失败实验

词频

在新词发现、热词发现等场景，词频越高，权重越高。词频统计可以按 句子级、

段落级、

篇章级等

篇章级，词在整篇文章中，没出现算零次，出现了算一次

句子级，词在整个句子中，没出现算零次，出现了算一次

最细级，出现多少次就算多少次

TFIDF

TF（Term Frequency）：词频 $NLP TER nlp term紧密度计算_NLP TER$

IDF（Inverse Document Frequency）：逆文本频率指数 $NLP TER nlp term紧密度计算_NLP TER_02$

TFIDF缺点

文档类型单一时，IDF将失去意义。例如，语料都是汽车主题，理应获得更高权重的汽车相关术语的IDF反而不高
在具有时效性的文本中，网络热词会阶段性频繁出现，这些热词理应要有较高权重，但阶段性IDF赋予的权重反而较低。

from collections import Counter
from math import log10
from re import split
from jieba.posseg import dt
FLAGS = set('a an b f i j l n nr nrfg nrt ns nt nz s t v vi vn z eng'.split())

def cut(text):
    for sentence in split('[^a-zA-Z0-9\u4e00-\u9fa5]+', text.strip()):
        for w in dt.cut(sentence):
            if len(w.word) > 1 and w.flag in FLAGS:
                yield w.word

class TFIDF:
    def __init__(self):
        self.idf = None
        self.idf_max = None

    def fit(self, texts):
        texts = [set(cut(text)) for text in texts]
        lent = len(texts)
        words = set(w for t in texts for w in t)
        self.idf = {w: log10(lent/(sum((w in t)for t in texts)+1)) for w in words}
        self.idf_max = log10(lent)
        return self

    def get_idf(self, word):
        return self.idf.get(word, self.idf_max)

    def extract(self, text, top_n=10):
        counter = Counter()
        for w in cut(text):
            counter[w] += self.get_idf(w)
        return [i[0] for i in counter.most_common(top_n)]

tfidf = TFIDF().fit(['奶茶', '巧克力奶茶', '巧克力酸奶', '巧克力', '巧克力']*2)
print(tfidf.extract('酸奶巧克力奶茶'))

词在文中位置

下面提供两种位置权重

粗粒度

词语位置	权重
标题	10
标题尾词	20
首句	4
末句	3
首段	3
末段	2
其它	1

细粒度

NLP TER nlp term紧密度计算_词频_03

""""""
from sklearn.gaussian_process import GaussianProcessRegressor
from jieba import lcut

X = [[0], [.1], [.2], [.3], [.4], [.5], [.6], [.7], [.8], [.9], [1]]
Y = [[1], [.2], [.04], [.02], [.01], [0], [0], [.01], [.03], [.1], [.5]]

class GPR:
    """高斯过程回归"""
    def __init__(self):
        self.model = GaussianProcessRegressor()
        self.model.fit(X, Y)

    def predict(self, position):
        return self.model.predict([[position]])[0]

    def extract(self, text, judge):
        words = lcut(text)
        le = len(words) - 1
        entities = [(self.predict(i/le)[0], words[i]) for i in range(le+1) if judge(words[i])]
        return entities

def visualization():
    from matplotlib import pyplot as mp
    w = [[i / 500] for i in range(501)]
    z = GPR().model.predict(w)
    mp.scatter(X, Y, s=66, color='g')
    mp.scatter(w, z, s=6, color='r')
    mp.show()

"""实体抽取并返回权重"""
print(GPR().extract('剑圣联合守望者斩杀大法师', lambda x: x in {'剑圣', '大法师', '守望者'}))
"""权重分布可视化"""
visualization()

示例句子：剑圣联合守望者斩杀大法师

词语	位置	位置百分比	权重
剑圣	0	0%	1.00
守望者	2	50%	0.00
大法师	4	100%	0.50

文章总长度

文章越长，单个词对文章的贡献值越低。下面提供几个公式：

NLP TER nlp term紧密度计算_NLP TER_04

词长

下面提供两种词长权重

词长度	1	2	3	4	5	6	7	8	9	10
	0.00	0.50	0.67	0.75	0.80	0.83	0.86	0.88	0.89	0.90
	1	1.414	1.732	2	2.236	2.449	2.646	2.828	3	3.162

词跨度

词频=1时，权重=1；
词频>1时，下面提供两种词跨度权重

简单版
$NLP TER nlp term紧密度计算_权重_07$

复杂版
$NLP TER nlp term紧密度计算_守望者_08$

def span1(text):
    """词跨度 + 词频"""
    words = list(text)  # 分词并返回列表，此处暂时用list
    reversed_words = words[::-1]
    le = len(words)
    return {w: 2-((words.index(w)+reversed_words.index(w)+1)/le) for w in set(words)}

def span2(text):
    c, position = dict(), dict()
    words = list(text)  # 分词并返回列表，此处暂时用list
    length = len(words)
    for i in range(length):
        word = words[i]
        if word not in c:
            c[word] = 1
        else:
            c[word] *= 1 + ((i - position[word]) / length)
        position[word] = i
    return c

_text = '清水水水水水水水里清'
print(span1(_text))
print(span2(_text))

句子示例：清水水水水水水水里清

词语	简单版权重	复杂版权重
清	1.9	1.9
水	1.6	1.771561
里	1	1

词性

通常，实词>虚词，名词权重较高。
对于如何设定权重值，我们可以找些标注数据来建立一个词性权重模型。

词与主题的关系

给予主题相关的词更高的权重

例如，文中【车主打开车盖，用苹果照了一下水箱】句子属于【汽车主题】，同为名词的【水箱】权重要比【苹果】高。

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from re import split
from jieba import cut
from collections import Counter
from numpy import argmax
from pandas import DataFrame

def segment(text):
    for phrase in split('[^a-zA-Z\u4e00-\u9fa5]+', text.strip()):
        for word in cut(phrase):
            yield word

def clf_word(texts, labels, clf=MultinomialNB()):
    """词分类，逻辑回归，存EXCEL"""
    # 向量化
    vectorizer = TfidfVectorizer(tokenizer=segment)
    x = vectorizer.fit_transform(texts)
    # 建模
    clf.fit(x, labels)
    classes = clf.classes_
    print(clf.__class__.__name__, clf.score(x, labels), *classes)
    # 词分类
    c = Counter(w for t in texts for w in segment(t)).most_common()
    ls = []
    for word, freq in c:
        predict_proba = clf.predict_proba(vectorizer.transform([word]))[0]  # 概率
        label = classes[argmax(predict_proba)]  # 类别
        ls.append([freq, word, label, *predict_proba])
    df = DataFrame(ls, columns=['freq', 'word', 'label', *classes])
    df.to_excel('a.xlsx', index=False)  # 存excel

if __name__ == '__main__':
    from data9 import X, Y  # 导入新闻9分类语料
    clf_word(X, Y)

NLP TER nlp term紧密度计算_词频_09

否定句

在特定否定语境中，关键词权重要变零

例句：大力发展智能冰箱、智能空调、智能热水器等高新技术（不含智能手机和智能电视）
抽出实体及其权重：【智能冰箱：2】【智能空调：2】【智能热水器：2】【智能手机：-1】【智能电视：-1】

{'不包含': -1, '不包括': -1, '不含': -1, '除外': -1, '包含': 1, '包括': 1}
re.compile(r'(?<![无没])不[^a-zA-Z\W\d_限无没]|除外')

自然衰减权重

出现某些特定词时，特定词后（可能跨句甚至跨段）的单词权重提高，并自然衰减

以下情况不列入支持
1、XXXXXXXX
2、XXXXXXXX
申请人条件如下：
1、XXXXXXXX
2、XXXXXXXX

from matplotlib import pyplot as mp
# 特定词
x2y = {'b': 20, 'c': 10, 'd': 30}
# 造数据
length = 100
x = ['a'] * length
x[15], x[30], x[75] = 'b', 'c', 'd'
# 计算权重
y = [0] * length
for i in range(length):
    if x[i] in x2y:
        weight = x2y[x[i]]
        for j in range(i + 1, length):
            y[j] += max(0, weight - abs(i - j))
mp.plot(y)
mp.show()

NLP TER nlp term紧密度计算_词频_10

TextRank

from collections import defaultdict
from jieba.posseg import dt

ALLOW_POS = frozenset(('ns', 'n', 'vn', 'v'))


class WeightedUndigraph:
    d = 0.85

    def __init__(self):
        self.graph = defaultdict(list)

    def add_edge(self, start, end, weight):
        self.graph[start].append((start, end, weight))
        self.graph[end].append((end, start, weight))

    def rank(self):
        ws = defaultdict(float)
        outSum = defaultdict(float)

        wsdef = 1.0 / (len(self.graph) or 1.0)
        for n, out in self.graph.items():
            ws[n] = wsdef
            outSum[n] = sum((e[2] for e in out), 0.0)

        for x in range(10):  # 10次迭代
            for n in sorted(self.graph.keys()):  # 排序更稳定
                s = 0
                for e in self.graph[n]:
                    s += e[2] / outSum[e[1]] * ws[e[1]]
                ws[n] = (1 - self.d) + self.d * s

        min_rank, max_rank = min(ws.values()), max(ws.values())
        for n, w in ws.items():
            ws[n] = (w - min_rank / 10.0) / (max_rank - min_rank / 10.0)  # 统一权重，无需乘以100
        return ws


class TextRank:
    def __init__(self):
        self.tokenizer = dt
        self.allow_pos = ALLOW_POS
        self.span = 5

    def flag_filter(self, wp):
        return (wp.flag in self.allow_pos) and (len(wp.word.strip()) >= 2)

    def text_rank(self, sentence, n=20, with_weight=False, allow_pos=ALLOW_POS, with_flag=False):
        """
        Parameter:
            - n: 返回关键词数量
            - with_weight: 是否返回权重
            - allow_pos: 允许的词性
            - with_flag: 是否返回词性
        """
        self.allow_pos = frozenset(allow_pos)
        g = WeightedUndigraph()
        cm = defaultdict(int)
        words = tuple(self.tokenizer.cut(sentence))
        for i, wp in enumerate(words):
            if self.flag_filter(wp):
                for j in range(i + 1, i + self.span):
                    if j >= len(words):
                        break
                    if not self.flag_filter(words[j]):
                        continue
                    if allow_pos and with_flag:
                        cm[(wp, words[j])] += 1
                    else:
                        cm[(wp.word, words[j].word)] += 1
        for terms, w in cm.items():
            g.add_edge(terms[0], terms[1], w)
        nodes_rank = g.rank()
        if with_weight:
            tags = sorted(nodes_rank.items(), key=lambda x: x[1], reverse=True)
        else:
            tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)
        return tags[:n] if n else tags


_t = '越来越多的国产汽车出现在大众的眼中，国产汽车的整体性能也在不断优化，但和老牌的欧美汽车相比还存在着一定的差距'
print(TextRank().text_rank(_t, with_weight=True, with_flag=True))

内联权重

聚类排除离群点

待开发

上下文特征向量

待开发

最尾补充一些失败实验

失败1：
尝试用词向量分布的离散程度来计算权重但失败，结果如下：
高频词向量[5 5 5 5 5 5 0 -5 -5 -5]离散程度高
中频词向量[4 3 3 3 3 3 0 -3 -3 -3]离散程度中
低频词向量[3 1 1 1 1 1 0 -1 -1 -1]离散程度低

from re import split, fullmatch
from gensim.models import Word2Vec
from jieba import cut
from numpy import var
from pandas import DataFrame

def lcut(text):
    return [w for s in split('[\n。…；;!！?？]+', text)for w in cut(s)if fullmatch('[a-zA-Z\u4e00-\u9fa5]+', w)]

def word2vector(texts):
    """词向量建模"""
    sentences = [lcut(t) for t in texts]
    wv = Word2Vec(sentences, size=75, window=10, sg=1).wv
    DataFrame([(w, var(wv[w]), *wv[w]) for w in wv.index2word], columns=[
        'word', 'weight', *(str(i) for i in range(75))]).to_excel('b.xlsx', index=False)

from data9 import X  # 新闻9分类，不导入标签
word2vector(X)

失败2：
词特征分布的离散程度：特征分布离散程度越高的词，权重应更高。
例如，两个IDF相同的词，第一个词集中分布在某类文档中，第二个词分散在不同类型的文档中，则第一个词的权重理应更高。

尝试用长文切短的主题模型来获取单词的主题分布离散程度，但结果失败，多数单词主题分布匀散，某些停词反而具有更为集中的主题分布。

from gensim import corpora, models
import re, jieba, numpy as np, pandas as pd

def word_lda(texts, num_topics=50):
    # 分词
    words_ls = [[w for w in jieba.cut(s) if re.fullmatch('[a-zA-Z\u4e00-\u9fa5]+', w)]
                for t in texts for s in re.split('[\n。…；;!！?？]+', t)]
    # 构造词典
    dictionary = corpora.Dictionary(words_ls)
    # 基于词典，使【词】→【稀疏向量】，并将向量放入列表，形成【稀疏向量集】
    corpus = [dictionary.doc2bow(words) for words in words_ls]
    # lda模型，num_topics设置主题的个数
    lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
    # 主题概率矩阵
    matrix = lda.state.get_lambda()
    matrix = matrix / np.sum(matrix, axis=0)
    # 按照主题概率聚类，并存excel
    pd.DataFrame({
        'word': [dictionary[i] for i in range(len(dictionary))],
        'topic': np.argmax(matrix, axis=0),
        'probability': np.max(matrix, axis=0),
    }).sort_values(by=['topic', 'probability'], ascending=False).to_excel('word_lda.xlsx', index=False)

from data9 import X  # 新闻9分类，不导入标签
word_lda(X)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。