python爬虫抓取页面 python爬取整个网站

转载

mob64ca14150f43 2023-09-14 16:54:38

文章标签 python爬虫抓取页面爬取数据 python爬虫获取整个网页数据获取大量数据 文章分类 Python 后端开发

本案例是基于PyCharm开发的，也可以使用idea。
在项目内新建一个python文件Test.py Test.py

# 导入urllib下的request模块
import urllib.request
# 导入正则匹配包
import re

# -*- encoding:utf-8 -*-
"""
@作者：小思
@文件名：Test.py
@时间：2018/11/13  14:42
@文档说明:测试爬虫（以爬取https://www.ittime.com.cn/news/zixun.shtml上的网页数据为例）
"""

# 步骤
# 1.确定要爬取数据的网址
# 2.获取该网址的源码
# 3.使用正则表达式去匹配网址的源码（匹配所需要的数据类型）
# 4.将爬取的数据保存至本地或者数据库

# 确定要爬取数据的网址
url="https://www.ittime.com.cn/news/zixun.shtml"
# 该网址的源码(以该网页的原编码方式进行编码，特殊字符编译不能编码就设置ignore)
webSourceCode=urllib.request.urlopen(url).read().decode("utf-8","ignore")

# 匹配数据的正则表达式
# 所有的图片
imgRe=re.compile(r'src="(.*?\.jpg)"')
# 所有的标题
titleRe=re.compile(r'<h2><a href=".*?" target="_blank">(.*?)</a></h2>')
# 所有的简介
contentRe=re.compile(r'<p>(.*?)</p>')
# 所有的作者
authorRe=re.compile(r'<span class="pull-left from_ori">(.*?)<span class="year">(.*?)</span></span>')
# 匹配网页对应的标题数据
titles=titleRe.findall(webSourceCode)
images=imgRe.findall(webSourceCode)
content=contentRe.findall(webSourceCode)
authors=authorRe.findall(webSourceCode)
print("标题==============================================================")
for title in titles:
    print(title)
print("图片==============================================================")
for image in images:
    print("https://www.ittime.com.cn"+image)
print("内容简介==============================================================")
for c in content:
    print(c)
print("作者==============================================================")
for author in authors:
    print(author[0])
print("时间==============================================================")
for time in authors:
    print(author[1])

运行Test.py，控制台输出信息。

python爬虫抓取页面 python爬取整个网站_python爬虫

python爬虫抓取页面 python爬取整个网站_爬取数据_02

python爬虫抓取页面 python爬取整个网站_获取整个网页数据_03

python爬虫抓取页面 python爬取整个网站_获取大量数据_04

python爬虫抓取页面 python爬取整个网站_python爬虫_05

如果分页的信息也全部需要，则写一个集合来保存这些需要读取数据的网址，将Test.py封装成方法。

在循环里依次调用

Test.py

# 导入urllib下的request模块
import urllib.request
# 导入正则匹配包
import re

# -*- encoding:utf-8 -*-
"""
@作者：小思
@文件名：Test.py
@时间：2018/11/13  14:42
@文档说明:测试爬虫（以爬取https://www.ittime.com.cn/news/zixun.shtml上的图片为例）
"""

# 步骤
# 1.确定要爬取数据的网址
# 2.获取该网址的源码
# 3.使用正则表达式去匹配网址的源码（匹配所需要的数据类型）
# 4.将爬取的数据保存至本地或者数据库
def getResouces(url):
    # 该网址的源码(以该网页的原编码方式进行编码，特殊字符编译不能编码就设置ignore)
    webSourceCode=urllib.request.urlopen(url).read().decode("utf-8","ignore")

    # 匹配数据的正则表达式
    # 所有的图片
    imgRe=re.compile(r'src="(.*?\.jpg)"')
    # 所有的标题
    titleRe=re.compile(r'<h2><a href=".*?" target="_blank">(.*?)</a></h2>')
    # 所有的简介
    contentRe=re.compile(r'<p>(.*?)</p>')
    # 所有的作者
    authorRe=re.compile(r'<span class="pull-left from_ori">(.*?)<span class="year">(.*?)</span></span>')
    # 匹配网页对应的标题数据
    titles=titleRe.findall(webSourceCode)
    images=imgRe.findall(webSourceCode)
    content=contentRe.findall(webSourceCode)
    authors=authorRe.findall(webSourceCode)
    print("标题==============================================================")
    for title in titles:
        print(title)
    print("图片==============================================================")
    for image in images:
        print("https://www.ittime.com.cn"+image)
    print("内容简介==============================================================")
    for c in content:
        print(c)
    print("作者==============================================================")
    for author in authors:
        print(author[0])
    print("时间==============================================================")
    for time in authors:
        print(author[1])

# 读取前十页的数据
for i in range(2,10):
    getResouces(f"https://www.ittime.com.cn/news/zixun_{i}.shtml")

注意！！！
①无论是java后台还是python后台需要大量的数据，都可以使用这种方式，它读取速度非常快，可以保存到本地，或者数据库。读取的时候要保持有网络哦~

②使用python爬取网页的数据并不困难，重要的是对你所需的数据的源代码的分析，要善于寻找规律，并且写出正确的正则表达式

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。