python 结巴统计词频 python用jieba对文本词频统计

转载

IT剑客之家 2023-06-04 21:10:44

import jieba
def getText():
    txt=open("hamlet.txt","r").read()
    txt=txt.lower()
    for ch in '|"#$%&()*+,-./:;<>+?@[\\]^_{|}~':
        txt=txt.replace(ch," ")
    return txt
harmTxt=getText()
words=harmTxt.split()
counts={}
for word in words:
    counts[word]=counts.get(word,0)+1

items=list(counts.items())

#按照第二个元素有大到小排序
items.sort(key=lambda  x:x[1],reverse=True)

for i in range(10):
    word, count=items[i]
    print(word,end=":")
    print(count)

运行结果

the:1138
and:965
to:754
of:668
you:549
a:542
i:540
my:514
hamlet:456
in:436

import jieba
txt=open("threekingdoms.txt","r",encoding="utf-8").read()
#总结一些不是人名的词
excludes={"将军","却说","二人","荆州","二人","不可","不能","如此","商议","不能","如此","左右","引兵","如何","主公"}
words=jieba.lcut(txt)
counts={}
for word in words:
    if len(word)==1:
        continue
    elif word=="诸葛亮" or word=="孔明曰":
        rword="孔明"
    elif word=="关公" or word=="云长":
        rword="关羽"
    elif word=="玄德" or word=="玄德曰":
        rword="刘备"
    elif word=="孟德" or word=="丞相":
        rword="曹操"
    else:
        rword=word
    counts[rword]=counts.get(rword,0)+1
for word in excludes:
    del counts[word]
items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
    word,count=items[i]
    print(word,end=":")
    print(count)

运行结果：

曹操:1451
孔明:1383
刘备:1252
关羽:784
张飞:358
军士:317
吕布:300
军马:293
赵云:278
次日:271

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：java获取数组角标 java获取数组的值

下一篇：Java 离date日期还差多少秒 java算日期差

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

python 结巴 统计词频 python用jieba对文本词频统计

python 结巴 统计词频 python用jieba对文本词频统计

51CTO博客

python 结巴统计词频 python用jieba对文本词频统计

python 结巴统计词频 python用jieba对文本词频统计