python 爬虫 408 Python 爬虫百度文库vip

转载

mob6454cc79ab13 2023-12-06 23:33:37

文章标签 python 爬虫 408 百度文库爬虫 requests 正则表达式 xpath 文章分类 Python 后端开发

导读

很高兴各位读者能够前来观看本帖，本次演示所用的python版本为3.7.2,需要预先安装好的python库有requests库和带有etree的lxml库(据说新版没有)

1.网页分析

首先打开百度文库首页 https://wenku.baidu.com

python 爬虫 408 Python 爬虫百度文库vip_正则表达式

随便点击进入一片帖子

python 爬虫 408 Python 爬虫百度文库vip_requests_02

（实验所用帖子链接：https://wenku.baidu.com/view/e77975cdb8f3f90f76c66137ee06eff9aef849c0.html）

在空白的地方单击右键：查看源代码，看到如下画面，这个就是网页的源代码了

python 爬虫 408 Python 爬虫百度文库vip_正则表达式_03

在观察页面信息，发现没有看到我们在文库中看到的内容，初步怀疑网页是通过异步加载的。并在其中的第363行发现异常，

python 爬虫 408 Python 爬虫百度文库vip_正则表达式_04

python 爬虫 408 Python 爬虫百度文库vip_正则表达式_05

查看这行内容发现有两个关键字：

pageLoadUrl:感觉像是描述接下来的url时用于加载页面的，

wkbjcloudbos.bdimg.com :感觉像是加载页面的url的域名

前去验证：现在返回打开的帖子

进入后按F12进入开发者视图在按F5刷新网页，并且点击网络

python 爬虫 408 Python 爬虫百度文库vip_requests_06

在过滤url输入框中输入我们之前怀疑的域名

wkbjcloudbos.bdimg.com

python 爬虫 408 Python 爬虫百度文库vip_requests_07

看到了三个可以的请求，类型为octet-stre… ，打开百度搜索，说是一种以数据流形式下载的文件，

python 爬虫 408 Python 爬虫百度文库vip_百度文库爬虫_08

那就验证下，右键第一个请求-----在新标签页打开

python 爬虫 408 Python 爬虫百度文库vip_python 爬虫 408_09

点击确定

python 爬虫 408 Python 爬虫百度文库vip_python 爬虫 408_10

提示：语法错误：json.parse:json数据的第1行第1列出现意外字符，点击：原始数据

python 爬虫 408 Python 爬虫百度文库vip_python 爬虫 408_11

看到了有unicode 编码的数据，复制下载在Python中打开

（复制的内容：

\u5e74\u666e\u901a\u9ad8\u7b49\u5b66\u6821\u62db\u751f\u5168\u56fd\u7edf\u4e00\u8003\u8bd5）

python 爬虫 408 Python 爬虫百度文库vip_requests_12

返回文库文章查看有没有出现

python 爬虫 408 Python 爬虫百度文库vip_python 爬虫 408_13

发现在文章中出现了

（经过多次验证发现都出现了，由于篇幅问题，这里就不演示了）

2.代码演示

import requests
from lxml import etree
import re
def get_text(url):
    #设置函数用于保存代码
    page_url = url
    #获取页面网址
    page_headers ={
        "User-Agent": "Mozilla/5.0 (Linux; U; en-us; KFAPWI Build/JDQ39) AppleWebKit/535.19 (KHTML, like Gecko) Silk/3.13 Safari/535.19 Silk-Accelerated=true",
        "Referer": "https://wk.baidu.com/?pcf=2",
        "Accept-Encoding":"gzip, deflate, br",
        }
    #获取页面headers
    html_code  = requests.get(url=page_url,headers=page_headers)
    #发起一个向变量page_url,存放的url发起一个requests.get请求，
    #请求头使用我们预先设置的page_headers,并将网页的返回内容存放到变量html_code中
    html_code.encoding = "gb2312"
    #转换网页编码为gb2312    
    html_etree = etree.HTML(html_code.text)
    #将网页内容转换为Element对象，并存放到变量 html_etree中
    info = html_etree.xpath('//script[@type="text/javascript"]/text()')
    #从利用xpath网页转换的Element对象中提取名为script，且text属性为text/javascript
    #的标签下的文本内容
    find_need_infos_reg = re.compile('WkInfo.htmlUrls(.*)WkInfo.verify_user_info')
    find_need_infos = find_need_infos_reg.search(str(info))
    need_infos =  (find_need_infos.group().strip("WkInfo.htmlUrls = ").strip(r";\n        WkInfo.verify_user_i").replace(r"\\x22","").replace(r"\\\\\\/","\\").replace("'",'"'))
    #通过正则提取除需要的信息
    url_find_reg = re.compile(r'pageLoadUrl:(.+?)}')
    url_lists = url_find_reg.findall(need_infos)
    #提取出url
    for info_url in url_lists:
        info_url = info_url.replace("\\","/")
        text_html = requests.get(url=info_url,headers=page_headers)
        text_find_reg = re.compile(r',{"c":(.*?),')
        text_lists = text_find_reg.findall(text_html.text)
        #提取出内容
        for text in text_lists:
            try:
                print (eval（text),end="")
                #尝试打印代码
            except:
                pass   

get_text(input("请输入url:"))

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。