Python爬虫入门练习一

原创

忆_恒心 2021-11-12 11:06:44 ©著作权

文章标签 Python爬虫入门练习一 python html xml 反爬虫 文章分类 代码人生

©著作权归作者所有：来自51CTO博客作者忆_恒心的原创作品，请联系作者获取转载授权，否则将追究法律责任

前言：

Python的爬虫相当方便，关于快速上手，建议这部分需要反复的练习，这部分也是自己学习Python这门语言的目的。

常用的模块下载：

pip install send2trash
pip install requests
pip install beautifulsoup4
pip install selenium
pip install openpyxl
pip install PyPDF2
pip install python-docx
pip installI imapclient
pip install twilio
pip install pillow
pip install pyobjc-core（仅在OSX上）
pip install pyobjc(仅在OSX上)
pip install python3-xlib（仅在Linux）
pip install pyautogui

爬虫代码：

练习一

#! python3
# downloadXkcd.py - Downloads every single XKCD comic.

import requests, os, bs4

url = 'https://xkcd.com' # starting url
os.makedirs('xkcd', exist_ok=True) # store comics in ./xkcd
while not url.endswith('#'):
    # Download the page.
    print('Downloading page %s...' % url)
    res = requests.get(url)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # Find the URL of the comic image.
    comicElem = soup.select('#comic img')
    if comicElem == []:
        print('Could not find comic image.')
    else:
        comicUrl = 'https:' + comicElem[0].get('src')
        # Download the image.
        print('Downloading image %s...' % (comicUrl))
        res = requests.get(comicUrl)
        res.raise_for_status()

        # Save the image to ./xkcd.
        imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
        for chunk in res.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()

    # Get the Prev button's url.
    prevLink = soup.select('a[rel="prev"]')[0]
    url = 'https://xkcd.com' + prevLink.get('href')

print('Done.')

Python爬虫入门练习一_html

结果：

Python爬虫入门练习一_Python爬虫入门练习一_02

遇到的问题：

由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败问题

Python爬虫入门练习一_Python爬虫入门练习一_03

解决办法：

1、以管理员身份打开命令提示符

netsh winsock reset

之后重启。

2、爬取的网址是否为国外网址，比如本例子，由于是国外的网址，同时手机流量不稳定也有可能出现超时

二、练习二：爬虫，爬小说：

前期准备工作：

pip install tqdm

pip install tqdm 

pip install lxml

pip install beautifulsoup4

实现代码：

import requests
import time
from tqdm import tqdm
from bs4 import BeautifulSoup

#根据目录查找出具体章节内容
def get_content(target):
    req = requests.get(url = target)
    req.encoding = 'utf-8'
    html = req.text
    bf = BeautifulSoup(html, 'lxml')
    texts = bf.find('div', id='content')
    content = texts.text.strip().split('\xa0'*4)
    return content

#查看爬取的章节找到章节获取章节名称、链接，调用链接内容
if __name__ == '__main__':
    server = 'https://www.xsbiquge.com'
    book_name = '诡秘之主.txt'
    target = 'https://www.xsbiquge.com/15_15338/'
    req = requests.get(url = target)
    req.encoding = 'utf-8'
    html = req.text
    chapter_bs = BeautifulSoup(html, 'lxml')
    chapters = chapter_bs.find('div', id='list')#查找id为list 方便后面查a 定位
    chapters = chapters.find_all('a')
    for chapter in tqdm(chapters):
        chapter_name = chapter.string
        url = server + chapter.get('href')
        content = get_content(url) #获取章节内容详情
        with open(book_name, 'a', encoding='utf-8') as f:
            f.write(chapter_name)
            f.write('\n')
            f.write('\n'.join(content))
            f.write('\n')

结果演示

Python爬虫入门练习一_python_04

小结：

文件读写，静态页面，入门练手

三、爬取漫画练习

最最最低级的反爬虫：不能直接按F12：

网址前缀+view-source:

分析代码：

难点：

js动态解析：就是没给出具体的图片信息，往往需要正则化解析地址。
下载图片遇到站内访问的情况。

其中：一种典型的通过Referer的反扒爬虫手段

Referer可以理解为来路，先打开章节URL链接，再打开图片链接。打开图片的时候，Referer的信息里保存的是章节URL。

比如：动漫之家网站的做法就是，站内的用户访问这个图片，我就给他看，从其它地方过来的用户，我就不给他看。

是不是站内用户，就是根据Referer进行简单的判断。

这就是很典型的，反爬虫手段！

解决办法：

也简单，它需要啥，咱给它就完了。

Python爬虫入门练习一_xml_05

感受：
这部分难点在于动态加载的问题处理，只是非常常见的方式，爬虫的时候也比较难分析，后面看看还能不能找到其他例子进行学习一下

上一篇：Python基础学习

下一篇：GitHub开源项目查找的使用技巧

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯