import jieba
seg_listDef = jieba.cut("我在学习自然语言处理")
seg_listAll = jieba.cut("我在学习自然语言处理", cut_all=True)
print("Default mode:"+" ".join(seg_listDef))
print("All mode:"+" ".join(seg_listAll))
jieba中的cut用于做词语分割,函数有三个参数常用,分别是
cut(sentence, cut_all=False, HMM=True) 第一个参数传
入需要进行词语分割的字符串,第二个参数用来指定分割的方法
默认为False,即不进行精确分割,反之为True,即进行精确分割
凡是能组成词语的全部分割出来。另一个参数为隐马尔可夫模型
(Hidden Markov Model)后续文章介绍.
Building prefix dict from the default dictionary ...
Loading model from cache
Default mode:我 在 学习 自然语言 处理
All mode:我 在 学习 自然 自然语言 语言 处理
Loading model cost 0.578 seconds.
Prefix dict has been built successfully.
import jieba
seg_list = jieba.cut_for_search("我在学习自然语言处理")
print(" ".join(seg_list))
cut_for_search(sentence, HMM=True)函数用两个参数,一
个是需要进行词语分割的字符串,另一个为隐马尔可夫模型,cut_
for_search为基于搜索引擎的精细分割。
以上两种方法返回的都是生成器类型,也可以通过next()方法打
印出结果
import jieba
import sys
seg_list = jieba.cut_for_search("我在学习自然语言处理")
while(True):
try:
print(next(seg_list))
except StopIteration:
sys.exit()
而jieba.lcut_for_search和jieba.lcut返回的都是list类型
可以直接进行打印输出。
import jieba
seg_list = jieba.lcut_for_search("如果放到旧字典中将会出错")
print(seg_list)
Building prefix dict from the default dictionary ...
Loading model from cache
['如果', '放到', '旧', '字典', '中将', '会', '出错']
Loading model cost 0.590 seconds.
Prefix dict has been built successfully.
这种情况下在进行词语分割时容易混淆词语,实际“中”和“将”分
别是两个不同的词。为此可以用suggest_freq(segment,
tune=False)方法
Parameter:
- segment : The segments that the word is expected
to be cut into.If the word should be treated as a
whole,use a str.
- tune : If True, tune the word frequency.
对于参数tune,如果设置True,我们就增大segment中词的出现
频率。
import jieba
jieba.suggest_freq(('中', '将'), True)
seg_list = jieba.lcut_for_search("如果放到旧字典中将会出错")
print(seg_list)
Building prefix dict from the default dictionary ...
Loading model from cache
Loading model cost 0.569 seconds.
Prefix dict has been built successfully.
['如果', '放到', '旧', '字典', '中', '将', '会', '出错']
jieba.analyse.extract_tags(sentence,
topK=20, withWeight=False, allowPOS=(),
withFlag=False) 函数可以用来做关键词提取,topK用来设定
函数返回前topK个出现频率最高的词语,allowPOS()用来指定
词性(参数可传'ns', 'n', 'vn', 'v','nr',分别代表“地名”
“名词”“名动词”“动词”“人名”,除了上述五个词性以外还有很多
这里不一一列举了),withWeight可以用来设置是否返回词语的
权重,如果为True,则最后返回元组形式,即(词语,权重)的元组。
import jieba.analyse as analyse
lines = open("白夜行.txt").read()
seg_list = analyse.extract_tags(lines,topK=20,withWeight=False,allowPOS=())
print(seg_list)
Building prefix dict from the default dictionary ...
Loading model from cache
Loading model cost 0.657 seconds.
Prefix dict has been built successfully.
['笹垣', '桐原', '雪穗', '今枝', '友彦', '利子',
'什么', '没有', '典子', '知道', '男子', '唐泽雪穗',
'警察', '菊池', '筱冢', '一成', '这么', '松浦',
'不是', '千都']
import jieba.analyse as analyse
lines = open("白夜行.txt").read()
seg_list = analyse.extract_tags(lines,topK=20,withWeight=True,allowPOS=())
print(seg_list)
Building prefix dict from the default dictionary ...
Loading model from cache
Loading model cost 0.759 seconds.
Prefix dict has been built successfully.
[('笹垣', 0.0822914672578027),
('桐原', 0.07917333917389914),
('雪穗', 0.07619078187625225),
('今枝', 0.060328999884221086),
('友彦', 0.05992228752545106),
('利子', 0.041188814819915855),
('什么', 0.028355297812861044),
('没有', 0.0282050104733996),
('典子', 0.025758449388768555),
('知道', 0.021181785348317664),
('男子', 0.021159435305523867),
('唐泽雪穗', 0.01992890557973146),
('警察', 0.018198774503613253),
('菊池', 0.01816537645564464),
('筱冢', 0.01803091457213799),
('一成', 0.01796642218172475),
('这么', 0.016991657412780303),
('松浦', 0.016132923564544516),
('不是', 0.015944699687736586),
('千都', 0.015726211205774485)]
我们也可以加入停用词,从而避免切割出来的词语包含一些无用
词,例如“的”“了”“什么”“这么”“是”“不是”等。
import jieba.analyse as analyse
lines = open("白夜行.txt").read()
analyse.set_stop_words("stopwords.txt")
seg_list = analyse.extract_tags(lines,topK=20,withWeight=False,allowPOS=())
print(seg_list)
Building prefix dict from the default dictionary ...
Loading model from cache
Loading model cost 0.743 seconds.
Prefix dict has been built successfully.
['笹垣', '桐原', '雪穗', '今枝', '友彦', '利子',
'典子', '唐泽雪穗', '警察', '菊池', '筱冢', '一成',
'松浦', '千都', '高宫', '奈美江', '正晴', '美佳',
'雄一', '康晴']
set_stop_words(stop_words_path)函数传入停用词文件路径
最终结果显然不包含像“什么”“没有”“知道”“不是”等非关键性词语。
jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False)
参数同以上几个函数,textrank算法的基本思想:
(1)将待切割的文本进行分词操作
(2)按照词语之间的关系构建无向带权图(图论的相关知识)
(3)计算图中结点的PageRank。
import jieba.analyse as analyse
lines = open("白夜行.txt").read()
analyse.set_stop_words("stopwords.txt")
seg_list = analyse.textrank(lines,topK=20,withWeight=True,allowPOS=('n','ns','vn','v'))
for each in seg_list:
print("Key words:"+each[0]+"\t"+"weights:"+str(each[1]))
Building prefix dict from the default dictionary ...
Loading model from cache
Loading model cost 0.586 seconds.
Prefix dict has been built successfully.
Key words:桐原 weights:1.0
Key words:可能 weights:0.42444673809670785
Key words:利子 weights:0.36847047979727526
Key words:应该 weights:0.3365721550231226
Key words:时候 weights:0.3278732402186668
Key words:警察 weights:0.3120355440367427
Key words:东西 weights:0.3068897401798211
Key words:开始 weights:0.3015519959941887
Key words:调查 weights:0.29838940592155194
Key words:典子 weights:0.29588671242198666
Key words:公司 weights:0.2939855813808517
Key words:电话 weights:0.27845709742538927
Key words:不会 weights:0.27350278630982
Key words:看到 weights:0.27028179300492206
Key words:发现 weights:0.2681890733271942
Key words:房间 weights:0.2672507877051219
Key words:工作 weights:0.2661521099652389
Key words:声音 weights:0.24633054809460497
Key words:露出 weights:0.22866657032979934
Key words:认为 weights:0.21634682021764443