起点月票榜数据分析起点月票榜

转载

小屁孩 2024-01-30 10:57:36

文章标签 起点月票榜数据分析爬虫 python html xml 文章分类 数据分析人工智能

文章目录

踩点
获取网页文本
XPath提取信息
破解字体反爬
获取并保存信息
获取所有页面
总代码(撒花)

踩点

首先进入起点月票榜的页面进行踩点 https://www.qidian.com/rank/yuepiao，进入后界面如下，首先我们需要知道自己要获取什么，这里我们提取小说名、作者、小说类型、小说状态、简介、最近更新、更新时间、以及月票数。

起点月票榜数据分析起点月票榜_html

在知道要获取什么信息后，右键检查(F12)，进入如下界面：
点击选择按钮，定位一下小说标题位置：
然后我们发现所有信息都在这里面

获取网页文本

先调用 requests.get 获取一下网页代码，写入文件中

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
response = requests.get('https://www.qidian.com/rank/yuepiao', headers=headers)
f = open("M:/a.txt", 'w')
f.write(response.text)
f.close()

按ctrl+f 搜索一下文本，发现信息全部都在
将获取网页的代码写成一个函数

def getHtml(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None

XPath提取信息

寻找各个信息的标签，我们发现一页共有20个小说，每个小说的信息都在li 标签下，进一步分析各个元素位置，与其他小说进行比较，确定有用的信息。

起点月票榜数据分析起点月票榜_xml_02

在分析完有用的标签后，我们就可以用 xpath 提取需要的信息

html = getHtml('https://www.qidian.com/rank/yuepiao')
html = etree.HTML(html)
html = etree.tostring(html)
html = etree.fromstring(html)
# 提取书名
name = html.xpath('//li//div[@class="book-mid-info"]//h4//a[@data-eid="qd_C40"]//text()')
print(len(name), name)
# 提取作者
author = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C41"]//text()')
print(len(author), author)
# 提取小说类型
types = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()')
print(len(types), types)
# 提取小说当前状态
status = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//span/text()')
print(len(status), status)
# 提取小说简介
intro = html.xpath('//li//div[@class="book-mid-info"]//p[@class="intro"]//text()')
intro = [i.strip() for i in intro] # 删除文字两边的空格
print(len(intro), intro)
# 提取当前最新章节
update = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()')
update = [i.strip() for i in update] # 删除文字两边的空格
print(len(update), update)
# 提取最近更新时间
date = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()')
print(len(date), date)

打印的结果如下，这说明我们已经获取到了我们想要的数据！

破解字体反爬

然后就是获取月票，但是！在查看月票数的时候，发现代码里面是乱码

起点月票榜数据分析起点月票榜_起点月票榜数据分析_03

网页上面显示小框框，我们看不出来到底是什么，我们去刚刚保存的网页代码文件里面找找。

起点月票榜数据分析起点月票榜_python_04

我们看到了一些&#....;的东西，一般遇到这种情况，意味着这是字体反爬，~~小说排行榜还有反爬是我没想到的~~
往上翻一翻，嗯？才点一下就找到了。。
我们在这个@font-face 中看到了几个网址，没错，这就是字体，复制网址https://qidian.gtimg.com/qd_anti_spider/jUlcIiMg.woff 在新标签页打开，直接下载（也可以复制.ttf结尾的 https://qidian.gtimg.com/qd_anti_spider/jUlcIiMg.ttf,这里我两个全都下载了）
获取了字体后，我们先去这个网站 http://fontstore.baidu.com/static/editor/index.html ，把.ttf的文件在网站中打开
我们可以看到这个字体就是0-9，然后使用一个Python的库fontTools 来处理这个字体文件，使用pip install fontTools即可安装

from fontTools.ttLib import TTFont

font = TTFont('M:/jUlcIiMg.woff')
font.saveXML('M:/font.xml')

利用上面的代码可以将.woff / .ttf 转为 .xml 格式的文件，然后我们在浏览器中打开xml文件
我们发现这个东西跟刚才字体解析网站解析的一模一样，那就是它了！我们用 fontTools 的 getBestCmap() 函数，获取映射。

from fontTools.ttLib import TTFont

font = TTFont('M:/jUlcIiMg.woff')
font.saveXML('M:/font.xml')
print(font.getBestCmap())

输出

{100293: ‘eight’, 100295: ‘four’, 100296: ‘three’, 100297: ‘one’, 100298: ‘period’, 100299: ‘two’, 100300: ‘nine’, 100301: ‘five’, 100302: ‘zero’, 100303: ‘six’, 100304: ‘seven’}

在疑惑为什么跟你看到的不一样？其实刚刚在字体解析网站以及xml中看到的是十六进制的(以0x开头)，而fontTools输出的是十进制，不信可以用计算器敲一下。
获取到映射后，我们再人工进行一下转换，将英文数字转为中文，并且剔除掉没有用的100298: 'period' ，注意到网页代码中的字体是以&#××××××;形式的，为了方便替换，我们也将键更改为这个形式：

font = TTFont('M:/jUlcIiMg.woff')
font.saveXML('M:/font.xml')
print(font.getBestCmap())
# 建立英文到数字的字典
camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,
        'nine': 9}
cp = {}
for k,v in font.getBestCmap().items():
    try: # 过滤掉非阿拉伯数字的100298: 'period
        cp['&#' + str(k) + ';'] = camp[v]
    except KeyError as e:
        pass
print(cp)

输出：

{’𘟅’: 8, ‘𘟇’: 4, ‘𘟈’: 3, ‘𘟉’: 1, ‘𘟋’: 2, ‘𘟌’: 9, ‘𘟍’: 5, ‘𘟎’: 0, ‘𘟏’: 6, ‘𘟐’: 7}

至此我们已经将字体映射关系找到，然后就可以直接用正则替换将获取到的网页代码中的这些字体，根据映射关系替换为正常的阿拉伯数字：

def getHtml(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None


font = TTFont('M:/jUlcIiMg.woff')
font.saveXML('M:/font.xml')
print(font.getBestCmap())

camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,
        'nine': 9}
cp = {}
for k, v in font.getBestCmap().items():
    try:
        cp['&#' + str(k) + ';'] = camp[v]
    except KeyError as e:
        pass
print(cp)
# 获取网址代码，保存到txt文本中
html = getHtml('https://www.qidian.com/rank/yuepiao')
f = open('M:/html.txt', 'w')
f.write(html)
f.close()

# 将网址代码中的加密字体替换为正常的数字，并保存到文本中
for key in cp.keys():
    html = re.sub(key, str(cp[key]), html)
f = open('M:/html_change.txt', 'w')
f.write(html)
f.close()

执行完毕后，我们去两个文本中查看是否替换成功

起点月票榜数据分析起点月票榜_python_07

诶，为什么没有替换成功，~~难道是re.sub写错了？~~ 不对，我们发现这里的字体与刚刚获取到的映射键一个都不一样
我们向上查看一下@font-face的内容，发现字体变了！我们刚才用的字体是jUlcIiMg.woff,而这里变成了OMkqwDTS.woff，看来每次访问的字体都不一样，既然如此，我们就不能直接下载单独的woff文件。
每次获取网址代码时，我们用正则将字体网址取出来，然后下载，再对字体文件进行解析，替换！为此我们将获取网址的函数改成下面这个样子，在获取网址后，直接提取字体网址，然后下载保存为 font.woff

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}

def getHtml(url):
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return None
    woff = re.search("format\('eot'\); src: url\('(.+?)'\) format\('woff'\)", response.text, re.S)
    fontfile = requests.get(woff.group(1), headers=headers)
    if fontfile.status_code != 200:
        return None
    f = open('M:/font.woff', 'wb')
    f.write(fontfile.content)
    f.close()
    return response.text

再进行一下测试：

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}

def getHtml(url):
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return None
    woff = re.search("format\('eot'\); src: url\('(.+?)'\) format\('woff'\)", response.text, re.S)
    fontfile = requests.get(woff.group(1), headers=headers)
    if fontfile.status_code != 200:
        return None
    f = open('M:/font.woff', 'wb')
    f.write(fontfile.content)
    f.close()
    return response.text


font = TTFont('M:/font.woff')
print(font.getBestCmap())

camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,
        'nine': 9}
cp = {}
for k, v in font.getBestCmap().items():
    try:
        cp['&#' + str(k) + ';'] = camp[v]
    except KeyError as e:
        pass
print(cp)

html = getHtml('https://www.qidian.com/rank/yuepiao')
f = open('M:/html.txt', 'w')
f.write(html)
f.close()

for key in cp.keys():
    html = re.sub(key, str(cp[key]), html)
f = open('M:/html_change.txt', 'w')
f.write(html)
f.close()

字体成功获取

起点月票榜数据分析起点月票榜_xml_08

替换成功！！！

起点月票榜数据分析起点月票榜_python_09

我们将处理字体的代码写成一个函数，使其看起来更加美观。

def fontProc(text):
    font = TTFont('M:/font.woff')
    camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,
            'nine': 9}
    cp = {}
    for k, v in font.getBestCmap().items():
        try:  # 过滤无用的映射
            cp['&#' + str(k) + ';'] = camp[str(v)]
        except KeyError as e:
            pass
    for key in cp.keys():
        text = re.sub(key, str(cp[key]), text)
    return text

获取并保存信息

在字体替换成功后，我们就可以用XPath将月票数提取出来，至此，我们的提取信息函数写成：

def getBook(html):
    html = etree.HTML(html)
    html = etree.tostring(html)
    html = etree.fromstring(html)
    name = html.xpath('//li//div[@class="book-mid-info"]//h4//a//text()')
    author = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@class="name"]//text()')
    types = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()')
    status = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//span//text()')
    intro = html.xpath('//li//div[@class="book-mid-info"]//p[@class="intro"]//text()')
    intro = [i.strip() for i in intro]
    update = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()')
    update = [i.strip() for i in update]

    date = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()')
    tickets = html.xpath('//li//div[@class="book-right-info"]//div[@class="total"]//p//span//span//text()')
    book = zip(name, author, types, status, intro, update, date, tickets)
    return book

为了方便，我们写一个保存该信息的函数：

def saveInfo(url):
    html = getHtml(url)
    html = fontProc(html)
    book = getBook(html)
    for name, author, types, status, intro, update, date, tickets in book:
        with open('M:/novels.txt', 'a+') as f:
            f.write('小说名：' + name + '\n')
            f.write('作者：' + author + '   小说类型：' + types + '   当前状态：' + status + '\n')
            f.write('小说简介：' + intro + '\n')
            f.write(update + '   更新时间：' + date + '\n')
            f.write('月票数：' + tickets + '\n')
            f.write('\n\n')

运行一下试试

saveInfo('https://www.qidian.com/rank/yuepiao')

起点月票榜数据分析起点月票榜_python_10

完美获取到我们想要的信息

获取所有页面

经过上面的分析与操作，我们已经获取到了所有信息，但是不难发现只获取到了一页，我们准备把所有页面都爬下来。
我们点一下页码2，发现网址变成了 https://www.qidian.com/rank/yuepiao?page=2，
再点一下页码3，发现网址变成了 https://www.qidian.com/rank/yuepiao?page=3。
已经发现了规律，第几页page参数就是几，因为总共只有五页，所以写成：

for page in range(1, 5 + 1):
    url = 'https://www.qidian.com/rank/yuepiao?page=%d'%page
    saveInfo(url)

运行一下发现出了问题，\xa0 是 latin1 中的扩展字符集字符，代表空白符&nbsp
我们将其替换为空白字符即可

将 getBook() 函数中的：

update = [i.strip() for i in update]

改为：

update = [i.strip().replace('\xa0', ' ') for i in update]

更改完毕后，再次运行，~~淦，又来~~

起点月票榜数据分析起点月票榜_起点月票榜数据分析_11

同理，将getBook函数中的

intro = [i.strip() for i in intro]

改为：

intro = [i.strip().replace('\u2022', ' ') for i in intro]

我们再次运行，~~#卧21lkad@#!4012~~

起点月票榜数据分析起点月票榜_html_12

再次替换：

intro = [i.strip().replace('\u2022', ' ').replace('\u2003', ' ') for i in intro]

再次运行，终于成功！我们需要的信息已经获取成功！

总代码(撒花)

import requests
from lxml import etree
from fontTools.ttLib import TTFont
import re

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
woffDir = './font.woff'
novelsDir = './novels.txt'


def getHtml(url):
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return None
    woff = re.search("format\('eot'\); src: url\('(.+?)'\) format\('woff'\)", response.text, re.S)
    fontfile = requests.get(woff.group(1), headers=headers)
    if fontfile.status_code != 200:
        return None
    f = open(woffDir, 'wb')
    f.write(fontfile.content)
    f.close()
    response.encoding = response.apparent_encoding
    return response.text


def fontProc(text):
    font = TTFont(woffDir)
    camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,
            'nine': 9}
    cp = {}
    for k, v in font.getBestCmap().items():
        try:  # 过滤无用的映射
            cp['&#' + str(k) + ';'] = camp[str(v)]
        except KeyError as e:
            pass
    for key in cp.keys():
        text = re.sub(key, str(cp[key]), text)
    return text


def getBook(html):
    html = etree.HTML(html)
    html = etree.tostring(html)
    html = etree.fromstring(html)
    name = html.xpath('//li//div[@class="book-mid-info"]//h4//a//text()')
    author = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@class="name"]//text()')
    types = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()')
    status = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//span//text()')
    intro = html.xpath('//li//div[@class="book-mid-info"]//p[@class="intro"]//text()')
    intro = [i.strip().replace('\u2022', ' ').replace('\u2003', ' ') for i in intro]
    update = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()')
    update = [i.strip().replace('\xa0', ' ') for i in update]
    date = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()')
    tickets = html.xpath('//li//div[@class="book-right-info"]//div[@class="total"]//p//span//span//text()')
    book = zip(name, author, types, status, intro, update, date, tickets)
    return book


def saveInfo(url):
    html = getHtml(url)
    html = fontProc(html)
    book = getBook(html)
    for name, author, types, status, intro, update, date, tickets in book:
        with open(novelsDir, 'a+') as f:
            f.write('小说名：' + name + '\n')
            f.write('作者：' + author + '   小说类型：' + types + '   当前状态：' + status + '\n')
            f.write('小说简介：' + intro + '\n')
            f.write(update + '   更新时间：' + date + '\n')
            f.write('月票数：' + tickets + '\n')
            f.write('\n\n')


for page in range(1, 5 + 1):
    url = 'https://www.qidian.com/rank/yuepiao?page=%d' % page
    saveInfo(url)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。