python 分词工具对比 python分词函数

转载

网络锐评 2023-08-12 14:52:47

文章标签 python 分词工具对比 nlp jieba 词类分割自然语言处理 文章分类 Python 后端开发

import jieba

seg_listDef = jieba.cut("我在学习自然语言处理")
seg_listAll = jieba.cut("我在学习自然语言处理", cut_all=True)
print("Default mode:"+" ".join(seg_listDef))
print("All mode:"+" ".join(seg_listAll))

jieba中的cut用于做词语分割，函数有三个参数常用，分别是
cut(sentence, cut_all=False, HMM=True) 第一个参数传
入需要进行词语分割的字符串，第二个参数用来指定分割的方法
默认为False，即不进行精确分割,反之为True，即进行精确分割
凡是能组成词语的全部分割出来。另一个参数为隐马尔可夫模型
(Hidden Markov Model)后续文章介绍.

Building prefix dict from the default dictionary ...
Loading model from cache 
Default mode:我 在 学习 自然语言 处理
All mode:我 在 学习 自然 自然语言 语言 处理
Loading model cost 0.578 seconds.
Prefix dict has been built successfully.

import jieba

seg_list = jieba.cut_for_search("我在学习自然语言处理")
print(" ".join(seg_list))

cut_for_search(sentence, HMM=True)函数用两个参数，一
个是需要进行词语分割的字符串，另一个为隐马尔可夫模型,cut_
for_search为基于搜索引擎的精细分割。

以上两种方法返回的都是生成器类型，也可以通过next()方法打
印出结果
import jieba
import sys

seg_list = jieba.cut_for_search("我在学习自然语言处理")
while(True):
    try:
        print(next(seg_list))
    except StopIteration:
        sys.exit()

而jieba.lcut_for_search和jieba.lcut返回的都是list类型
可以直接进行打印输出。

import jieba

seg_list = jieba.lcut_for_search("如果放到旧字典中将会出错")
print(seg_list)

Building prefix dict from the default dictionary ...
Loading model from cache 
['如果', '放到', '旧', '字典', '中将', '会', '出错']
Loading model cost 0.590 seconds.
Prefix dict has been built successfully.

这种情况下在进行词语分割时容易混淆词语，实际“中”和“将”分
别是两个不同的词。为此可以用suggest_freq(segment,
tune=False)方法
Parameter:
- segment : The segments that the word is expected 
to be cut into.If the word should be treated as a 
whole,use a str.
- tune : If True, tune the word frequency.
对于参数tune,如果设置True,我们就增大segment中词的出现
频率。

import jieba

jieba.suggest_freq(('中', '将'), True)
seg_list = jieba.lcut_for_search("如果放到旧字典中将会出错")
print(seg_list)

Building prefix dict from the default dictionary ...
Loading model from cache 
Loading model cost 0.569 seconds.
Prefix dict has been built successfully.
['如果', '放到', '旧', '字典', '中', '将', '会', '出错']

jieba.analyse.extract_tags(sentence, 
topK=20, withWeight=False, allowPOS=(), 
withFlag=False) 函数可以用来做关键词提取，topK用来设定
函数返回前topK个出现频率最高的词语，allowPOS()用来指定
词性(参数可传'ns', 'n', 'vn', 'v','nr'，分别代表“地名”
“名词”“名动词”“动词”“人名”，除了上述五个词性以外还有很多
这里不一一列举了),withWeight可以用来设置是否返回词语的
权重,如果为True,则最后返回元组形式,即(词语,权重)的元组。

import jieba.analyse as analyse

lines = open("白夜行.txt").read()
seg_list = analyse.extract_tags(lines,topK=20,withWeight=False,allowPOS=())
print(seg_list)

Building prefix dict from the default dictionary ...
Loading model from cache 
Loading model cost 0.657 seconds.
Prefix dict has been built successfully.
['笹垣', '桐原', '雪穗', '今枝', '友彦', '利子',
 '什么', '没有', '典子', '知道', '男子', '唐泽雪穗',
 '警察', '菊池', '筱冢', '一成', '这么', '松浦', 
 '不是', '千都']


import jieba.analyse as analyse
lines = open("白夜行.txt").read()
seg_list = analyse.extract_tags(lines,topK=20,withWeight=True,allowPOS=())
print(seg_list)

Building prefix dict from the default dictionary ...
Loading model from cache 
Loading model cost 0.759 seconds.
Prefix dict has been built successfully.
[('笹垣', 0.0822914672578027), 
('桐原', 0.07917333917389914), 
('雪穗', 0.07619078187625225), 
('今枝', 0.060328999884221086), 
('友彦', 0.05992228752545106), 
('利子', 0.041188814819915855), 
('什么', 0.028355297812861044), 
('没有', 0.0282050104733996), 
('典子', 0.025758449388768555), 
('知道', 0.021181785348317664), 
('男子', 0.021159435305523867), 
('唐泽雪穗', 0.01992890557973146), 
('警察', 0.018198774503613253), 
('菊池', 0.01816537645564464), 
('筱冢', 0.01803091457213799), 
('一成', 0.01796642218172475), 
('这么', 0.016991657412780303), 
('松浦', 0.016132923564544516), 
('不是', 0.015944699687736586), 
('千都', 0.015726211205774485)]

我们也可以加入停用词，从而避免切割出来的词语包含一些无用
词，例如“的”“了”“什么”“这么”“是”“不是”等。

import jieba.analyse as analyse

lines = open("白夜行.txt").read()
analyse.set_stop_words("stopwords.txt")
seg_list = analyse.extract_tags(lines,topK=20,withWeight=False,allowPOS=())
print(seg_list)

Building prefix dict from the default dictionary ...
Loading model from cache 
Loading model cost 0.743 seconds.
Prefix dict has been built successfully.
['笹垣', '桐原', '雪穗', '今枝', '友彦', '利子',
'典子', '唐泽雪穗', '警察', '菊池', '筱冢', '一成',
'松浦', '千都', '高宫', '奈美江', '正晴', '美佳',
'雄一', '康晴']

set_stop_words(stop_words_path)函数传入停用词文件路径
最终结果显然不包含像“什么”“没有”“知道”“不是”等非关键性词语。

jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False)
参数同以上几个函数，textrank算法的基本思想：
(1)将待切割的文本进行分词操作
(2)按照词语之间的关系构建无向带权图(图论的相关知识)
(3)计算图中结点的PageRank。

import jieba.analyse as analyse

lines = open("白夜行.txt").read()
analyse.set_stop_words("stopwords.txt")
seg_list = analyse.textrank(lines,topK=20,withWeight=True,allowPOS=('n','ns','vn','v'))
for each in seg_list:
    print("Key words:"+each[0]+"\t"+"weights:"+str(each[1]))

Building prefix dict from the default dictionary ...
Loading model from cache 
Loading model cost 0.586 seconds.
Prefix dict has been built successfully.
Key words:桐原	weights:1.0
Key words:可能	weights:0.42444673809670785
Key words:利子	weights:0.36847047979727526
Key words:应该	weights:0.3365721550231226
Key words:时候	weights:0.3278732402186668
Key words:警察	weights:0.3120355440367427
Key words:东西	weights:0.3068897401798211
Key words:开始	weights:0.3015519959941887
Key words:调查	weights:0.29838940592155194
Key words:典子	weights:0.29588671242198666
Key words:公司	weights:0.2939855813808517
Key words:电话	weights:0.27845709742538927
Key words:不会	weights:0.27350278630982
Key words:看到	weights:0.27028179300492206
Key words:发现	weights:0.2681890733271942
Key words:房间	weights:0.2672507877051219
Key words:工作	weights:0.2661521099652389
Key words:声音	weights:0.24633054809460497
Key words:露出	weights:0.22866657032979934
Key words:认为	weights:0.21634682021764443

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。