使用python统计《三国演义》小说里人物出现次数前十名,并实现可视化。



python统计三国演义里的人物出场次数

一、安装所需要的第三方库

jieba (jieba是优秀的中文分词第三分库)
pyecharts (一个优秀的数据可视化库)

《三国演义》.txt下载地址(提取码:kist )

使用pycharm安装库

  • 打开Pycharm选择【File】下的Settings
  • 出现下面页面,
  • 选择右边的【+】出现下面页面,在此页面顶端搜索想要的库,然后安装就可以了

二、编写代码

import jieba  #导入库
import os
print("人物出现次数前十名:")
txt = open('三国演义.txt', 'r' ,encoding='gb18030').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"  # 把相同意思的名字归为一个人
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
   word, count=items[i]
   print("{}:{}".format(word, count))  # 打印前十名名单
  • 结果如下图:
  • Python三国演义分词 三国演义人物统计python_jieba库


  • 可以看到这里面有很多不是人物的名字,所以咱们要把这些删掉。更改代码如下
import jieba  #导入库
import os
print("人物出现次数前十名:")
txt = open('三国演义.txt', 'r' ,encoding='gb18030').read()
remove = {"将军", "却说", "不能", "后主", "上马", "不知", "天子", "大叫", "众将", "不可",
            "主公", "蜀兵", "只见", "如何", "商议", "都督", "一人", "汉中", "人马",
            "陛下", "魏兵", "天下", "今日", "左右", "东吴", "于是", "荆州", "不能", "如此",
            "大喜", "引兵", "次日", "军士", "军马","二人","不敢"}  # 这些文字是要排出掉的,多次运行程序所得到的
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"  # 把相同意思的名字归为一个人
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1
for word in remove:
    del counts[word]  #匹配文字相等就删除

items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
   word, count=items[i]
   print("{}:{}".format(word, count))  # 打印前十名名单
  • 运行结果如下图
  • Python三国演义分词 三国演义人物统计python_数据_02


可以看到现在都是人物名称了

  • 导出数据,代码如下
import jieba  #导入库
import os
print("人物出现次数前十名:")
txt = open('三国演义.txt', 'r' ,encoding='gb18030').read()
remove = {"将军", "却说", "不能", "后主", "上马", "不知", "天子", "大叫", "众将", "不可",
            "主公", "蜀兵", "只见", "如何", "商议", "都督", "一人", "汉中", "人马",
            "陛下", "魏兵", "天下", "今日", "左右", "东吴", "于是", "荆州", "不能", "如此",
            "大喜", "引兵", "次日", "军士", "军马","二人","不敢"}  # 这些文字是要排出掉的,多次运行程序所得到的
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"  # 把相同意思的名字归为一个人
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1
for word in remove:
    del counts[word]  #匹配文字相等就删除

items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)

#导出数据

fo = open("三国人物出场次数.txt", "a", encoding='utf-8') 
for i in range(10):
   word, count=items[i]
   word = str(word)
   count = str(count)
   fo.write(word)
   fo.write(':') #使用冒号分开
   fo.write(count)
   fo.write('\n') #换行 
fo.close() #关闭文件
  • 现在咱们运行看是否导出,运行结果如下图。
  • Python三国演义分词 三国演义人物统计python_数据_03


可以看到已经生成一个名为三国人物出场次数.txt的文件,而文件里的内容就是咱们刚才的数据。

三、数据可视化
  • 想要可视化首先咱们要有数据,咱们把刚才导出的数据转换为字典形式。代码如下
#将txt文本里的数据转换为字典形式
fr = open('三国人物出场次数.txt', 'r', encoding='utf-8')
dic = {}
keys = [] # 用来存储读取的顺序
for line in fr:
  v = line.strip().split(':')
  dic[v[0]] = v[1]
  keys.append(v[0])
fr.close()
print(dic)

-运行结果如下

Python三国演义分词 三国演义人物统计python_数据_04

  • 使用pyecharts绘图
  • 先倒入模块
from pyecharts import options as opts
from pyecharts.charts import Bar
  • 代码如下
# 绘图
list1=list(dic.keys())
list2=list(dic.values())  #提取字典里的数据作为绘图数据
c = (
    Bar()
    .add_xaxis(list1)
    .add_yaxis("人物出场次数",list2)
    .set_global_opts(
        xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
    )
    .render("人物出场次数可视化图.html")
)
  • 运行程序看到目录下会生成一个名为人物出场次数可视化图.html的文件,如下图
  • Python三国演义分词 三国演义人物统计python_数据_05


  • 使用浏览器打开,就可以看到数据以图形的方式呈现出来。
  • Python三国演义分词 三国演义人物统计python_jieba库_06


三、全部代码呈现
#《三国演义》的人物出场次数Python代码:


import jieba  #导入库
import os
from pyecharts import options as opts
from pyecharts.charts import Bar

print("人物出现次数前十名:")
txt = open('三国演义.txt', 'r' ,encoding='gb18030').read()
remove = {"将军", "却说", "不能", "后主", "上马", "不知", "天子", "大叫", "众将", "不可",
            "主公", "蜀兵", "只见", "如何", "商议", "都督", "一人", "汉中", "人马",
            "陛下", "魏兵", "天下", "今日", "左右", "东吴", "于是", "荆州", "不能", "如此",
            "大喜", "引兵", "次日", "军士", "军马","二人","不敢"}  # 这些文字是要排出掉的,多次运行程序所得到的
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"  # 把相同意思的名字归为一个人
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1
for word in remove:
    del counts[word]  #匹配文字相等就删除

items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)

#导出数据

fo = open("三国人物出场次数.txt", "a", encoding='utf-8')
for i in range(10):
   word, count=items[i]
   word = str(word)
   count = str(count)
   fo.write(word)
   fo.write(':') #使用冒号分开
   fo.write(count)
   fo.write('\n') #换行
fo.close() #关闭文件

#将txt文本里的数据转换为字典形式
fr = open('三国人物出场次数.txt', 'r',encoding='utf-8' )
dic = {}
keys = [] # 用来存储读取的顺序
for line in fr:
  v = line.strip().split(':')
  dic[v[0]] = v[1]
  keys.append(v[0])
fr.close()
print(dic)


# 绘图
list1=list(dic.keys())
list2=list(dic.values())  #提取字典里的数据作为绘图数据
c = (
    Bar()
    .add_xaxis(list1)
    .add_yaxis("人物出场次数",list2)
    .set_global_opts(
        xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
    )
    .render("人物出场次数可视化图.html")
)