hanlp分词结果不带上词性 hanlp分词器

转载

mob6454cc6553fc 2023-10-26 07:10:56

文章标签 hanlp分词结果不带上词性自然语言处理大数据词性 java 文章分类 NLP 人工智能

hanlp分词

为什么要使用hanlp分词？
词典分词的精度并不是很好，分词在于速度而不是精度。hanlp在准确度相对精准的情况下能够做到速度也相对快

长短文本分词特性：
hanlp所有的分词器都是继承自Segment这个基类
如果是短文本优先考虑：DoubleArrayTrieSegment(DAT)
如果是长文本就考虑使用：AhoCorasickDoubleArrayTrieSegment(ACDAT)
建议就是如果需要对文本进行分词就可以直接使用DAT字典树，如果不需要对文本进行分词的化就可以直接进行敏感词过滤然后使用ACDAT

DoubleArrayTrieSegment

from pyhanlp import *
HanLP.Config.ShowTermNature = False
segment = DoubleArrayTrieSegment()
print(segment.seg('江西鄱阳湖干枯，中国最大淡水湖变成大草原'))

[江西, 鄱阳湖, 干枯, ，, 中国, 最大, 淡水湖, 变成, 大草原]

分词的时候是否显示词性
HanLP.Config.ShowTermNature = False
定义一个DAT分词模型
segment = DoubleArrayTrieSegment()
激活数字和字母识别
segment.enablePartOfSpeechTagging(True)

HanLP.Config.ShowTermNature = True
segment.enablePartOfSpeechTagging(True)
for term in segment.seg("上海市虹口区大连西路550号SISU"):
    print("单词：%s 词性：%s"%(term.word,term.nature))

单词：上海市 词性：ns
单词：虹口区 词性：ns
单词：大连 词性：ns
单词：西路 词性：n
单词：550 词性：m
单词：号 词性：q
单词：SISU 词性：nx

AhoCorasickDoubleArrayTrieSegment

segment=JClass("com.hankcs.hanlp.seg.Other.AhoCorasickDoubleArrayTrieSegment")()
HanLP.Config.ShowTermNature = False
print(segment.seg("江西鄱阳湖干枯，中国最大淡水湖变成大草原"))

[江西, 鄱阳湖, 干枯, ，, 中国, 最大, 淡水湖, 变成, 大草原]

ACDAT的速度会比DAT速度慢一点但是处理长文本比DAT的准确率会高一些。这里这是因为这个文本本身比较短而已。

停用词

复杂方式

from jpype import JString

from pyhanlp import *


def load_from_file(path):
    """
    从词典文件加载DoubleArrayTrie
    :param path: 词典路径
    :return: 双数组trie树
    """
    map = JClass('java.util.TreeMap')()  # 创建TreeMap实例
    with open(path) as src:
        for word in src:
            word = word.strip()  # 去掉Python读入的\n
            map[word] = word
    return JClass('com.hankcs.hanlp.collection.trie.DoubleArrayTrie')(map)


def load_from_words(*words):
    """
    从词汇构造双数组trie树
    :param words: 一系列词语
    :return:
    """
    map = JClass('java.util.TreeMap')()  # 创建TreeMap实例
    for word in words:
        map[word] = word
    return JClass('com.hankcs.hanlp.collection.trie.DoubleArrayTrie')(map)


def remove_stopwords_termlist(termlist, trie):
    return [term.word for term in termlist if not trie.containsKey(term.word)]


def replace_stropwords_text(text, replacement, trie):
    searcher = trie.getLongestSearcher(JString(text), 0)
    offset = 0
    result = ''
    while searcher.next():
        begin = searcher.begin
        end = begin + searcher.length
        if begin > offset:
            result += text[offset: begin]
        result += replacement
        offset = end
    if offset < len(text):
        result += text[offset:]
    return result


if __name__ == '__main__':
    HanLP.Config.ShowTermNature = False
    trie = load_from_file(HanLP.Config.CoreStopWordDictionaryPath)
    text = "停用词的意义相对而言无关紧要吧。"
    segment = DoubleArrayTrieSegment()
    termlist = segment.seg(text)
    print("分词结果：", termlist)
    print("分词结果去除停用词：", remove_stopwords_termlist(termlist, trie))
    trie = load_from_words("的", "相对而言", "吧")
    print("不分词去掉停用词", replace_stropwords_text(text, "**", trie))

分词结果： [停用, 词, 的, 意义, 相对而言, 无关紧要, 吧, 。]
分词结果去除停用词： ['停用', '词', '意义', '无关紧要']
不分词去掉停用词 停用词**意义**无关紧要**。

简单方式

# 使用停用词的简单例子

text = "停用词的意义相对而言无关紧要吧"
CRFnewSegment = HanLP.newSegment("crf")
term_list = CRFnewSegment.seg(text)

CoreStopWordDictionary = JClass("com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary")
CoreStopWordDictionary.apply(term_list)
HanLP.Config.ShowTermNature = False

print(term_list)
print([i.word for i in term_list])

[停用词, 意义, 相对]
['停用词', '意义', '相对']

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：android旋转屏幕更换layout文件安卓旋转屏幕

下一篇：Android通知不可关闭关闭安卓通知

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯