曲牌名

获取来源

之前爬取过元代的诗,众所周知,曲牌名出自于元代,唐诗,宋词,元曲

收集曲牌名

import pandas as pd
import xlwt



#读取yuan代的诗词
def read(file):
data=pd.read_excel(file)
title=data.title
# 存储一个曲排名列表
qu_list=[]
for it in title:
if it.find('·')!=-1:
# 根据诗词名获取对应的曲牌名
qu=it.split('·')
qu_list.append(qu[0])
new_qu=list(set(qu_list))
#将曲牌名进行保存
xl = xlwt.Workbook()
# 调用对象的add_sheet方法
sheet1 = xl.add_sheet('sheet1', cell_overwrite_ok=True)
sheet1.write(0, 0, "qu_name")

for i in range(0, len(new_qu)):
sheet1.write(i + 1, 0, new_qu[i])

xl.save("qupai_name.xlsx")





if __name__ == '__main__':
file='data/yuan.xlsx'
read(file)

成果展示

1104-曲牌名,飞花令,诗句对应的飞花令_数据

 

 飞花令

获取来源

古诗词网上有专门的飞花令字词,因此我们的来源就是它

1104-曲牌名,飞花令,诗句对应的飞花令_数据_02

 

 爬取飞花令

import requests
from bs4 import BeautifulSoup
from lxml import etree

headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}#创建头部信息

hc=[]

url='https://www.xungushici.com/feihualings'
r=requests.get(url,headers=headers)
content=r.content.decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
ul=soup.find('ul',class_='list-unstyled d-flex flex-row flex-wrap align-items-center w-100')
li_list=ul.find_all('li',class_='m-1 badge badge-light')
word=[]
for it in li_list:
word.append(it.a.text)

import xlwt

xl = xlwt.Workbook()
# 调用对象的add_sheet方法
sheet1 = xl.add_sheet('sheet1', cell_overwrite_ok=True)

sheet1.write(0,0,"word")

for i in range(0,len(word)):
sheet1.write(i+1,0,word[i])

xl.save("word.xlsx")

结果展示

1104-曲牌名,飞花令,诗句对应的飞花令_html_03

 

诗句-飞花令

思路

通过遍历爬取的50万首古诗,分析每个句子是否有包含的飞花令中的关键字,如果有将其存储起来:诗句、作者、诗名、关键字

BUG

如果用xlwt来存储,最多存储65536行数据,用openpyxl可以存储100万行数据。由于我们的诗句数据过大,因此需采用openpyxl来进行存储

代码

import pandas as pd
import xlwt
import openpyxl

#读取飞花令
def read_word():
data=pd.read_excel('data2/word.xlsx')
words=data.word
return words

#遍历诗句
def read(file,words,write_file):
data=pd.read_excel(file)
title=data.title
content=data.content
author=data.author
#进行切分出单句
ans_sentens = []
ans_author = []
ans_title = []
ans_key = []
for i in range(len(title)):
print("第"+str(i)+"个")
cont=content[i]
aut=author[i]
tit=title[i]
sents=cont.replace('\n','').split('。')
for it in sents:
key_list = []
for k in words:
if it.find(k)!=-1:
key_list.append(k)
if len(key_list)!=0:
ans_sentens.append(it)
ans_author.append(aut)
ans_title.append(tit)
ans_key.append(",".join(key_list))

#存储对应的key,author,title,sentenous
xl = openpyxl.Workbook()
# 调用对象的add_sheet方法
sheet1 = xl.create_sheet(index=0)
sheet1.cell(1, 1, "sentens")
sheet1.cell(1, 2, "author")
sheet1.cell(1, 3, "title")
sheet1.cell(1, 4, "keys")

for i in range(0, len(ans_key)):
sheet1.cell(i + 2, 1, ans_sentens[i])
sheet1.cell(i + 2, 2, ans_author[i])
sheet1.cell(i + 2, 3, ans_title[i])
sheet1.cell(i + 2, 4, ans_key[i])
xl.save(write_file)
print("保存成功到-"+write_file)

#获取指定文件夹下的excel
import os
def get_filename(path,filetype): # 输入路径、文件类型例如'.xlsx'
name = []
for root,dirs,files in os.walk(path):
for i in files:
if os.path.splitext(i)[1]==filetype:
name.append(i)
return name # 输出由有后缀的文件名组成的列表

if __name__ == '__main__':
file='data/'
words=read_word()
list = get_filename(file, '.xlsx')
for i in range(len(list)):
new_file=file+list[i]
print(new_file)
sentences_file = "sentences/sentence" + str(i+1) + ".xlsx"
read(new_file,words,sentences_file)

结果展示

1104-曲牌名,飞花令,诗句对应的飞花令_中文分词_04

 

 

1104-曲牌名,飞花令,诗句对应的飞花令_中文分词_05

 

 明日任务

先学习常见的中文分词工具,分出对应的相关实体,做个小demo尝试

中文分词,试图将诗人个人经历,逐个分段,梳理出这几类关键信息:人物,时间,事件,地点。将文本抽取为规则化的数据格式。