一、选题背景

在大数据的时代,人们的物质生活提升了很多,对视频的播放内容,都有自己独特的简介,因而在视频中,会被某个视频,进行评论,此项目,就是抓取B站视频评论,并使用词云图进行展示。

 

二、开发的环境与硬件支撑和功能的描述

开发环境:

 Python 3.7.4  +  Pycharm 2020.1.3

 Python是Python代码运行环境,Pycharm是编辑器,用于写Python代码


实训目的

抓取指定B站的评论数据,并使用stylecloud生成可视化词云图。

 

实训内容

1、使用爬虫技术,抓取B战视频主页的评论数据:

代码截图和效果截图:

A、 User-Agent大列表, 防止被反爬

 

Python网络爬虫大作业设计报告_Mac

 

B、 导包部分

a) Requests-html是请求模块,用于发送请求;

b) Jasonpath是解析模块,用于解析疫情数据

c) worldcloud是可视化模块,用于词云可视化

d) Numpy模块,数据分析模块,用于数据分析

e) os模块:创建文件夹,用于保存

f) Xlutils,xlrd, xlwd模块,用于保存excel评论文件

g) Time模块,用于添加时间延时,进行时间转换

h) Random模块,用于生成随机延时时间

i) Re模块,用于解析

 

 

Python网络爬虫大作业设计报告_Python网络爬虫大作业设计报告_02

 

C、 初始化部分,获取用户输入电影名字,翻页起始页码数,百度搜索接口部分

Python网络爬虫大作业设计报告_Python网络爬虫大作业设计报告_03

 

D、 发送请求,获取响应数据,其中响应数据为response响应,提取豆瓣电影相关电影链接。

 

E、 请求用户输入地址

 

Python网络爬虫大作业设计报告_Mac_04

 

解析评论,并且使用is_running实现下一页翻页

Python网络爬虫大作业设计报告_Python网络爬虫大作业设计报告_05

 

G、 解析生成词云图

Python网络爬虫大作业设计报告_Mac_06

 

H、 解析评论大列表,使用jsonpath解析,并且将解析出的格林威治时间进行时间转换

Python网络爬虫大作业设计报告_Python网络爬虫大作业设计报告_07

 

I、 获取评论内容后,将数据保存乘excel表格

 

Python网络爬虫大作业设计报告_Python网络爬虫大作业设计报告_08

 

Python网络爬虫大作业设计报告_Python网络爬虫大作业设计报告_09

Python网络爬虫大作业设计报告_Mac_10

 

J、 生成地图词云图,使用地图背景

Python网络爬虫大作业设计报告_Python网络爬虫大作业设计报告_11

 

K、 代码运行结果图:

Python网络爬虫大作业设计报告_Mac_12

 

Python网络爬虫大作业设计报告_ci_13

Python网络爬虫大作业设计报告_Chrome_14

 

 

 blblMapObjSpider.py代码

USER_AGENT_LIST = [
                  'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
                  'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
                  'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
                  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
                  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
                  'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
                  'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
                  'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
                  'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
                  'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
                  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',

                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
                  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
                  'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
                  'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
                  'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0',
                  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36',
                  'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84',
                  'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0',
                  'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
                  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
                  ]

from requests_html import HTMLSession
from jsonpath import jsonpath
from PIL import Image
import os, xlwt, xlrd, time, stylecloud, random, re
from xlutils.copy import copy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud

session = HTMLSession()


class BZSpider(object):

    def __init__(self):
        self.yun_list = []
        self.start_url = 'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={}&type=1&oid={}&mode=3&plat=1&_=1623082600632'
        """循环条件"""
        self.is_running = True
        """循环计数"""
        self.start_page = 1
        """评论内容大容器"""
        self.big_list = []
        # self.start_url = input('请输入视频的链接')
        self.pinglun_url = 'https://www.bilibili.com/video/BV1PN411X7QW?from=search&seid=13620076173109636987'

    def parse_pl_url_response(self):
        """
        解析用户输入的地址
        :return:
        """
        headers = {
            'user-agent': random.choice(USER_AGENT_LIST)
        }
        response = session.get(self.pinglun_url, headers=headers).content.decode()
        aid_set = re.findall(r'"aid":(.*?),', response)
        aid_list = list(set(aid_set))
        for aid in aid_list:
            self.parse_start_url(aid)
        """回调解析词云图方法"""
        self.parse_c_y_img()

    def parse_start_url(self, aid):
        """
        解析视频的评论
        :return:
        """
        while self.is_running:
            headers = {
                'user-agent': random.choice(USER_AGENT_LIST)
            }
            response = session.get(self.start_url.format(self.start_page, aid), headers=headers).json()
            """jsonpath提取评论大列表"""
            data_replies = jsonpath(response, '$..replies')[0]
            """回调解析评论大列表"""
            self.parse_data_replies(data_replies)
            """循环出口"""
            if data_replies == 'null':
                self.is_running = False
            if self.start_page == 10:
                self.is_running = False
            """循环计数 +1"""
            self.start_page += 1
            break

    def parse_c_y_img(self):
        """
        解析生成词云图
        :return:
        """
        print('--------------词云图生成中logging--------------')
        data = ''.join(self.big_list)
        stylecloud.gen_stylecloud(data, font_path="C:/Windows/Fonts/simfang.ttf")
        img = Image.open("stylecloud.png")
        img.show()
        print('\n' + '----------------------词云图已生成---------------------' + '\n')

    def parse_data_replies(self, data_replies):
        """
        解析评论大列表
        :param data_replies:
        :return:
        """
        for dict_data in data_replies:
            message = jsonpath(dict_data, '$..message')
            c_time = jsonpath(dict_data, '$..ctime')
            for text, temp in zip(message, c_time):
                """时间戳转换"""
                timeArray = time.localtime(int(temp))
                otherStyleTime = time.strftime("%Y--%m--%d %H:%M:%S", timeArray)
                self.big_list.append(text)
                data = {
                    '评论数据': [otherStyleTime, text]
                }
                self.save_excel(data)
                self.yun_list.append(text)
                print('评论数据保存一条完成----logging!!!')


    def save_excel(self, data):
        # data = {
        #     '基本详情': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
        # }
        os_path_1 = os.getcwd() + '/数据/'
        if not os.path.exists(os_path_1):
            os.mkdir(os_path_1)
        # os_path = os_path_1 + self.os_path_name + '.xls'
        os_path = os_path_1 + '评论数据.xls'
        if not os.path.exists(os_path):
            # 创建新的workbook(其实就是创建新的excel)
            workbook = xlwt.Workbook(encoding='utf-8')
            # 创建新的sheet表
            worksheet1 = workbook.add_sheet("评论数据", cell_overwrite_ok=True)
            borders = xlwt.Borders()  # Create Borders
            """定义边框实线"""
            borders.left = xlwt.Borders.THIN
            borders.right = xlwt.Borders.THIN
            borders.top = xlwt.Borders.THIN
            borders.bottom = xlwt.Borders.THIN
            borders.left_colour = 0x40
            borders.right_colour = 0x40
            borders.top_colour = 0x40
            borders.bottom_colour = 0x40
            style = xlwt.XFStyle()  # Create Style
            style.borders = borders  # Add Borders to Style
            """居中写入设置"""
            al = xlwt.Alignment()
            al.horz = 0x02  # 水平居中
            al.vert = 0x01  # 垂直居中
            style.alignment = al
            # 合并 第0行到第0列 的 第0列到第13列
            '''基本详情13'''
            # worksheet1.write_merge(0, 0, 0, 13, '基本详情', style)
            excel_data_1 = ('评论时间', '评论内容')
            for i in range(0, len(excel_data_1)):
                worksheet1.col(i).width = 2560 * 3
                #               行,列,  内容,            样式
                worksheet1.write(0, i, excel_data_1[i], style)
            workbook.save(os_path)
        # 判断工作表是否存在
        if os.path.exists(os_path):
            # 打开工作薄
            workbook = xlrd.open_workbook(os_path)
            # 获取工作薄中所有表的个数
            sheets = workbook.sheet_names()
            for i in range(len(sheets)):
                for name in data.keys():
                    worksheet = workbook.sheet_by_name(sheets[i])
                    # 获取工作薄中所有表中的表名与数据名对比
                    if worksheet.name == name:
                        # 获取表中已存在的行数
                        rows_old = worksheet.nrows
                        # 将xlrd对象拷贝转化为xlwt对象
                        new_workbook = copy(workbook)
                        # 获取转化后的工作薄中的第i张表
                        new_worksheet = new_workbook.get_sheet(i)
                        for num in range(0, len(data[name])):
                            new_worksheet.write(rows_old, num, data[name][num])
                        new_workbook.save(os_path)

    def show_img(self):
        '''
            生成地图词云图
        '''
        data = ''.join(self.yun_list)
        bg = np.array(Image.open("qq.jpg"))
        mask = bg
        wc = WordCloud(width=500,  # 词云图宽
                       height=500,  # 词云图高
                       mask=mask,  # 词云蒙版图
                       background_color='white',  # 词云图背景颜色,默认为白色
                       font_path=r'C:/Windows/Fonts/simfang.ttf',  # 词云图 字体(中文需要设定为本机有的中文字体)
                       max_font_size=400,  # 最大字体,默认为200
                       random_state=50,  # 为每个单词返回一个PIL颜色
                       )
        wc.generate(data)
        # matplotlib用于显示 词云图
        import matplotlib.pyplot as plt
        plt.imshow(wc)
        plt.axis("off")
        # plt方式存为本地图片
        plt.savefig('B站视频-词云图.png')
        plt.show()


if __name__ == '__main__':
    b = BZSpider()
    b.parse_pl_url_response()
    b.show_img()

 

爬取内容:

 

Python网络爬虫大作业设计报告_Python网络爬虫大作业设计报告_15

 

四、实训总结

这次实训,在同学和老师的帮助下,成功完成,收货颇多,了解了requests请求库的使用,jsonpath, jsonpath数据解析库的使用。

此次实训中,发现对worldcloud了解不够深入,了解了面向对象的含义,对反爬机制有了进一步了解。