一、选题背景
在大数据的时代,人们的物质生活提升了很多,对视频的播放内容,都有自己独特的简介,因而在视频中,会被某个视频,进行评论,此项目,就是抓取B站视频评论,并使用词云图进行展示。
二、开发的环境与硬件支撑和功能的描述
开发环境:
Python 3.7.4 + Pycharm 2020.1.3
Python是Python代码运行环境,Pycharm是编辑器,用于写Python代码
三、实训目的
抓取指定B站的评论数据,并使用stylecloud生成可视化词云图。
四、实训内容
1、使用爬虫技术,抓取B战视频主页的评论数据:
代码截图和效果截图:
A、 User-Agent大列表, 防止被反爬
B、 导包部分
a) Requests-html是请求模块,用于发送请求;
b) Jasonpath是解析模块,用于解析疫情数据
c) worldcloud是可视化模块,用于词云可视化
d) Numpy模块,数据分析模块,用于数据分析
e) os模块:创建文件夹,用于保存
f) Xlutils,xlrd, xlwd模块,用于保存excel评论文件
g) Time模块,用于添加时间延时,进行时间转换
h) Random模块,用于生成随机延时时间
i) Re模块,用于解析
C、 初始化部分,获取用户输入电影名字,翻页起始页码数,百度搜索接口部分
D、 发送请求,获取响应数据,其中响应数据为response响应,提取豆瓣电影相关电影链接。
E、 请求用户输入地址
解析评论,并且使用is_running实现下一页翻页
G、 解析生成词云图
H、 解析评论大列表,使用jsonpath解析,并且将解析出的格林威治时间进行时间转换
I、 获取评论内容后,将数据保存乘excel表格
J、 生成地图词云图,使用地图背景
K、 代码运行结果图:
blblMapObjSpider.py代码
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36',
'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0',
'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
]
from requests_html import HTMLSession
from jsonpath import jsonpath
from PIL import Image
import os, xlwt, xlrd, time, stylecloud, random, re
from xlutils.copy import copy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
session = HTMLSession()
class BZSpider(object):
def __init__(self):
self.yun_list = []
self.start_url = 'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={}&type=1&oid={}&mode=3&plat=1&_=1623082600632'
"""循环条件"""
self.is_running = True
"""循环计数"""
self.start_page = 1
"""评论内容大容器"""
self.big_list = []
# self.start_url = input('请输入视频的链接')
self.pinglun_url = 'https://www.bilibili.com/video/BV1PN411X7QW?from=search&seid=13620076173109636987'
def parse_pl_url_response(self):
"""
解析用户输入的地址
:return:
"""
headers = {
'user-agent': random.choice(USER_AGENT_LIST)
}
response = session.get(self.pinglun_url, headers=headers).content.decode()
aid_set = re.findall(r'"aid":(.*?),', response)
aid_list = list(set(aid_set))
for aid in aid_list:
self.parse_start_url(aid)
"""回调解析词云图方法"""
self.parse_c_y_img()
def parse_start_url(self, aid):
"""
解析视频的评论
:return:
"""
while self.is_running:
headers = {
'user-agent': random.choice(USER_AGENT_LIST)
}
response = session.get(self.start_url.format(self.start_page, aid), headers=headers).json()
"""jsonpath提取评论大列表"""
data_replies = jsonpath(response, '$..replies')[0]
"""回调解析评论大列表"""
self.parse_data_replies(data_replies)
"""循环出口"""
if data_replies == 'null':
self.is_running = False
if self.start_page == 10:
self.is_running = False
"""循环计数 +1"""
self.start_page += 1
break
def parse_c_y_img(self):
"""
解析生成词云图
:return:
"""
print('--------------词云图生成中logging--------------')
data = ''.join(self.big_list)
stylecloud.gen_stylecloud(data, font_path="C:/Windows/Fonts/simfang.ttf")
img = Image.open("stylecloud.png")
img.show()
print('\n' + '----------------------词云图已生成---------------------' + '\n')
def parse_data_replies(self, data_replies):
"""
解析评论大列表
:param data_replies:
:return:
"""
for dict_data in data_replies:
message = jsonpath(dict_data, '$..message')
c_time = jsonpath(dict_data, '$..ctime')
for text, temp in zip(message, c_time):
"""时间戳转换"""
timeArray = time.localtime(int(temp))
otherStyleTime = time.strftime("%Y--%m--%d %H:%M:%S", timeArray)
self.big_list.append(text)
data = {
'评论数据': [otherStyleTime, text]
}
self.save_excel(data)
self.yun_list.append(text)
print('评论数据保存一条完成----logging!!!')
def save_excel(self, data):
# data = {
# '基本详情': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
# }
os_path_1 = os.getcwd() + '/数据/'
if not os.path.exists(os_path_1):
os.mkdir(os_path_1)
# os_path = os_path_1 + self.os_path_name + '.xls'
os_path = os_path_1 + '评论数据.xls'
if not os.path.exists(os_path):
# 创建新的workbook(其实就是创建新的excel)
workbook = xlwt.Workbook(encoding='utf-8')
# 创建新的sheet表
worksheet1 = workbook.add_sheet("评论数据", cell_overwrite_ok=True)
borders = xlwt.Borders() # Create Borders
"""定义边框实线"""
borders.left = xlwt.Borders.THIN
borders.right = xlwt.Borders.THIN
borders.top = xlwt.Borders.THIN
borders.bottom = xlwt.Borders.THIN
borders.left_colour = 0x40
borders.right_colour = 0x40
borders.top_colour = 0x40
borders.bottom_colour = 0x40
style = xlwt.XFStyle() # Create Style
style.borders = borders # Add Borders to Style
"""居中写入设置"""
al = xlwt.Alignment()
al.horz = 0x02 # 水平居中
al.vert = 0x01 # 垂直居中
style.alignment = al
# 合并 第0行到第0列 的 第0列到第13列
'''基本详情13'''
# worksheet1.write_merge(0, 0, 0, 13, '基本详情', style)
excel_data_1 = ('评论时间', '评论内容')
for i in range(0, len(excel_data_1)):
worksheet1.col(i).width = 2560 * 3
# 行,列, 内容, 样式
worksheet1.write(0, i, excel_data_1[i], style)
workbook.save(os_path)
# 判断工作表是否存在
if os.path.exists(os_path):
# 打开工作薄
workbook = xlrd.open_workbook(os_path)
# 获取工作薄中所有表的个数
sheets = workbook.sheet_names()
for i in range(len(sheets)):
for name in data.keys():
worksheet = workbook.sheet_by_name(sheets[i])
# 获取工作薄中所有表中的表名与数据名对比
if worksheet.name == name:
# 获取表中已存在的行数
rows_old = worksheet.nrows
# 将xlrd对象拷贝转化为xlwt对象
new_workbook = copy(workbook)
# 获取转化后的工作薄中的第i张表
new_worksheet = new_workbook.get_sheet(i)
for num in range(0, len(data[name])):
new_worksheet.write(rows_old, num, data[name][num])
new_workbook.save(os_path)
def show_img(self):
'''
生成地图词云图
'''
data = ''.join(self.yun_list)
bg = np.array(Image.open("qq.jpg"))
mask = bg
wc = WordCloud(width=500, # 词云图宽
height=500, # 词云图高
mask=mask, # 词云蒙版图
background_color='white', # 词云图背景颜色,默认为白色
font_path=r'C:/Windows/Fonts/simfang.ttf', # 词云图 字体(中文需要设定为本机有的中文字体)
max_font_size=400, # 最大字体,默认为200
random_state=50, # 为每个单词返回一个PIL颜色
)
wc.generate(data)
# matplotlib用于显示 词云图
import matplotlib.pyplot as plt
plt.imshow(wc)
plt.axis("off")
# plt方式存为本地图片
plt.savefig('B站视频-词云图.png')
plt.show()
if __name__ == '__main__':
b = BZSpider()
b.parse_pl_url_response()
b.show_img()
爬取内容:
四、实训总结
这次实训,在同学和老师的帮助下,成功完成,收货颇多,了解了requests请求库的使用,jsonpath, jsonpath数据解析库的使用。
此次实训中,发现对worldcloud了解不够深入,了解了面向对象的含义,对反爬机制有了进一步了解。