Scrapy 爬虫框架
1. 概述
Scrapy是一个可以爬取网站数据,为了提取结构性数据而编写的开源框架。Scrapy的用途非常广泛,不仅可以应用到网络爬虫中,还可以用于数据挖掘、数据监测以及自动化测试等。Scrapy是基于Twisted的异步处理框架,架构清晰、可扩展性强,可以灵活完成各种需求。
在Scrapy的工作流程中主要包括以下几个部分:
§ Scrapy Engine(框架的引擎):用于处理整个系统的数据流,触发各种事件,是整个框架的核心。
§ Scheduler(调度器):用于接收引擎发过来的请求,添加至队列中,在引擎再次请求时将请求返回给引擎。可以理解为从URL队列中取出一个请求地址,同时去除重复的请求地址。
§ Downloader(下载器):用于从网络下载Web资源。
§ Spiders(网络爬虫):从指定网页中爬取需要的信息。
§ Item Pipline(项目管道):用于处理爬取后的数据,例如数据的清洗、验证以及保存。
§ Downloader Middlewares(下载器中间件):位于Scrapy引擎和下载器之间,主要用于处理 引擎与下载器之间的网络请求与响应。
§ Spider Middlewares(爬虫中间件):位于爬虫与引擎之间,主要用于处理爬虫的响应输入和请求输出。
§ Scheduler Middlewares(调度中间件):位于引擎和调度之间,主要用于处理从引擎发送到调度的请求和响应。
2. 搭建Scrapy爬虫框架
本人的系统环境是macOS,第三方开发工具PyCharm,在terminal下输入命令"pip install scrapy"。
liuxiaowei@MacBookAir spiders % pip install scrapy
说 明
Scrapy框架在安装的过程中,同时会将lxml与pyOpenSSL模块也安装在Python环境当中。
3. Scrapy的基本应用
3.1 创建Scrapy项目
在指定(也可以是任意路径)的路径下创建一个保存项目的文件夹,例如,在“/Users/liuxiaowei/PycharmProjects/爬虫练习/Scrapy爬虫框架“内运行命令行窗口,然后输入”scrapy startproject scrapyDemo“,即可创建一个名称为”scrapyDemo“的项目,如下所示:
(venv) liuxiaowei@MacBookAir Scrapy爬虫框架 % scrapy startproject scrapyDemo
New Scrapy project 'scrapyDemo', using template directory '/Users/liuxiaowei/PycharmProjects/爬虫练习/venv/lib/python3.9/site-packages/scrapy/templates/project', created in:
/Users/liuxiaowei/PycharmProjects/爬虫练习/Scrapy爬虫框架/scrapyDemo
You can start your first spider with:
cd scrapyDemo
scrapy genspider example example.com
打开刚刚创建的scrapyDemo项目,项目打开以后,在左侧项目的目录结构中可以看到如下图所示的目录结构:
目录结构中的文件说明如下:
§ spiders(文件夹):用于创建爬虫文件,编写爬虫规则。
§ __ init __文件:初始化文件。
§ items文件:用于数据的定义,可以寄存处理后的数据。
§ middlerwares文件:定义爬取时的中间件,其中包括SpiderMiddleware(爬虫中间件)、DownloaderMiddleware(下载中间件)
§ pipelines文件:用于实现清洗数据、验证数据、保存数据。
§ settings文件:整个框架的配置文件,主要包含配置爬虫信息,请求头、中间件等。
§ scrapy.cfg文件:项目部署文件,其中定义了项目等配置文件路径等相关信息。
3. 2 创建爬虫
在创建爬虫时,首先需要创建一个爬虫模块文件,该文件需要放置在spiders文件夹当中。爬虫模块是用于从一个网站或多个网站中爬取数据的类,它需要继承scrapy.Spider类,scrapy.Spider类中提供了start_requests()方法实现初始化网络请求,然后通过parse()方法解析返回的结果。scrapy.Spider类中常用属性与方法含义如下:
§ name:用于定义一个爬虫名称的字符串。Scrapy通过这个爬虫名称进行爬虫的查找,所以这名称必须是唯一的,不过我们可以生成多个相同的爬虫实例。如果爬取单个网站一般会用这个网站的名称作为爬虫的名称。
§ allowed_domains:包含了爬虫允许爬取的域名列表,当OffsiteMiddleware启动时,域名不在列表中的URL不会被爬取。
§ start_urls:URL的初始列表,如果没有指定特定的URL,爬虫将从该列表中进行爬取。
§ custom_settings:这是一个专属于当前爬虫的配置,是一个字典类型的数据,设置该属性会覆盖整个项目的全局,所以在设置该属性时必须在实例化前更新,必须定义为类变量。
§ settings:这是一个settings对象,通过它,我们可以获取项目的全局设置变量。
§ logger:使用Spider创建的Python日志器。
§ start_requests():该方法用于生成网络请求,它必须返回一个可迭代对象。该方法默认使用start_urls中的URL来生成request, 而request请求方式为GET,如果我们下通过POST方式请求网页时,可以使用FormRequest()重写该方法。
§ parse():如果response没有指定回调函数时,该方法是Scrapy处理response的默认方法。该方法负责处理response并返回处理的数据和下一步请求,然后返回一个包含request或Item的可迭代对象。
§ closed():当爬虫关闭时,该函数会被调用。该方法用于代替监听工作,可以定义释放资源或是收尾操作。
3.2.1 爬取网页代码并保存为HTML文件
以爬取下图所示的网页为例,实现爬取网页后将网页的代码以HTML文件形式保存值项目文件夹当中。
在spiders文件夹当中创建一个名称为“crawl.py”的爬虫文件,然后在该文件中,首先创建QuotesSpider类,该类需要继承自scrapy.Spider类,然后重写start_requests()方法实现网络的请求工作,接着重写parse()方法实现向文件中写入获取的html代码。示例代码如下:
#_*_coding:utf-8_*_
# 作者 :liuxiaowei
# 创建时间 :2/17/22 11:18 AM
# 文件 :crawl.py
# IDE :PyCharm
# 导入框架
import scrapy
class QuotesSpider(scrapy.Spider):
# 定义爬虫名称
name = 'quotes_1'
def start_requests(self):
# 设置爬取目标的地址
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
# 获取所有地址,有几个地址则发送几个请求
for url in urls:
# 发送请求
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# 获取页数
page = response.url.split('/')[-2]
# 根据页数设置文名称
filename = 'quotes-%s.html' % page
# 以写入文件模式打开文件,如果没有该文件将创建该文件
with open(filename, 'wb') as f:
# 向文件中写入获取的HTML代码
f.write(response.body)
# 输出保存文件的名称
self.log('Saved file %s' % filename)
在运行Scrapy所创建的爬虫项目时,需要在命令窗口输入“scrapy crawl quotes_1“,其中”quotes_1“是自己定义的爬虫名称。本人使用第三方开发工具PyCharm,所以需要在底部的Terminal窗口中输入运行爬虫的命令行,运行完成以后如下图所示:
liuxiaowei@MacBookAir spiders % scrapy crawl quotes_1 # 运行爬虫的命令行
2022-02-17 11:23:47 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 11:23:47 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 11:23:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-17 11:23:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapyDemo',
'NEWSPIDER_MODULE': 'scrapyDemo.spiders',
'ROBOTSTXT_OBEY': True,
................. # 省略中间字符
2022-02-17 11:23:49 [quotes_1] DEBUG: Saved file quotes-1.html
2022-02-17 11:23:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2022-02-17 11:23:49 [quotes_1] DEBUG: Saved file quotes-2.html
2022-02-17 11:23:49 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-17 11:23:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
............ # 省略中间的字符
2022-02-17 11:23:49 [scrapy.core.engine] INFO: Spider closed (finished)
3.2.2 使用FormRequest()函数,实现一个POST请求
示例代码如下:
#_*_coding:utf-8_*_
# 作者 :liuxiaowei
# 创建时间 :2/17/22 12:43 PM
# 文件 :POST请求.py
# IDE :PyCharm
# 导入框架
import scrapy
# 导入json模块
import json
class QuotesSPider(scrapy.Spider):
name = "quotes_2"
# 字典类型的表单参数
data = {'1':'能力是有限的, 而努力是无限的。',
'2':'星光不问赶路人, 时光不负有心人。'}
def start_requests(self):
return [scrapy.FormRequest('http://httpbin.org/post', formdata=self.data, callback=self.parse)]
# 响应信息
def parse(self, response):
# 将响应数据转换为字典类型
response_dict = json.loads(response.text)
# 打印转换后的响应数据
print(response_dict)
运行结果如下:
liuxiaowei@MacBookAir spiders % scrapy crawl quotes_2 # 运行爬虫命令
2022-02-17 12:53:01 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 12:53:01 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 12:53:01 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-17 12:53:01 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapyDemo',
'NEWSPIDER_MODULE': 'scrapyDemo.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapyDemo.spiders']}
2022-02-17 12:53:01 [scrapy.extensions.telnet] INFO: Telnet Password: 6965cfb5ccb132d6
2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-17 12:53:01 [scrapy.core.engine] INFO: Spider opened
2022-02-17 12:53:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-17 12:53:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-17 12:53:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)
2022-02-17 12:53:02 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://httpbin.org/post> (referer: None)
{'args': {}, 'data': '', 'files': {}, 'form': {'1': '能力是有限的, 而努力是无限的。', '2': '星光不问赶路人, 时光不负有心人。'}, 'headers': {'Accept': 'text/html,applicati0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en', 'Content-Length': '286', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'Scrapy/2.5.1 (+https://scrapy.org)', 'X-Amzn-Trace-Id': 'Root=1-620dd4ae-3eaa8de12c3f3606567f0039'}, 'json': None, 'origin': '122.143.185.159', 'url': 'http://httpbin.org/post'}
2022-02-17 12:53:02 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-17 12:53:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 772,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 1214,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 1.026007,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 2, 17, 4, 53, 2, 396943),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'memusage/max': 54693888,
'memusage/startup': 54693888,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 2, 17, 4, 53, 1, 370936)}
2022-02-17 12:53:02 [scrapy.core.engine] INFO: Spider closed (finished)
** 说 明**
除了使用在命令窗口中输入命令“scrapy crawl quotes_2“启动爬虫程序以外,Scrapy还提供了可以在程序中启动爬虫的API,也就是CrawlProcess类。首先需要在CrawlProcess类初始化时传入项目的settings 信息,然后在crawl()方法中传入爬虫的名称,最后通过start()方法启动爬虫。代码如下:
# 导入CrawlProcess类
from scrapy.crawler import CrawlerProcess
# 导入获取项目设置信息
from scrapy.utils.project import get_project_settings
# 程序入口
if __name__ == "__main__":
# 创建CrawlProcess类对象并传入项目设置信息参数
process = CrawlerProcess(get_project_settings())
# 设置需要启动的爬虫名称
process.crawl('quotes_2')
# 启动爬虫
process.start()
运行结果如下:
/Users/liuxiaowei/PycharmProjects/爬虫练习/venv/bin/python /Users/liuxiaowei/PycharmProjects/爬虫练习/Scrapy爬虫框架/scrapyDemo/scrapyDemo/spiders/POST请求.py
2022-02-17 13:02:16 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 13:02:16 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 13:02:16 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-17 13:02:16 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapyDemo',
'NEWSPIDER_MODULE': 'scrapyDemo.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapyDemo.spiders']}
2022-02-17 13:02:16 [scrapy.extensions.telnet] INFO: Telnet Password: 7aa61c26ffb3372a
2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-17 13:02:16 [scrapy.core.engine] INFO: Spider opened
2022-02-17 13:02:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-17 13:02:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-17 13:02:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)
2022-02-17 13:02:17 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://httpbin.org/post> (referer: None)
{'args': {}, 'data': '', 'files': {}, 'form': {'1': '能力是有限的, 而努力是无限的。', '2': '星光不问赶路人, 时光不负有心人。'}, 'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en', 'Content-Length': '286', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'Scrapy/2.5.1 (+https://scrapy.org)', 'X-Amzn-Trace-Id': 'Root=1-620dd6d9-1e241f7e7f705c1172c103b5'}, 'json': None, 'origin': '122.143.185.159', 'url': 'http://httpbin.org/post'}
2022-02-17 13:02:17 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-17 13:02:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 772,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 1214,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 1.030657,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 2, 17, 5, 2, 17, 995487),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'memusage/max': 48164864,
'memusage/startup': 48164864,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 2, 17, 5, 2, 16, 964830)}
2022-02-17 13:02:17 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
注 意
如果在运行Scrapy所创建的爬虫项目时,出现SyntaxError:invalid syntax的错误信息,说明Python3.7这个版本将“async“识别成了关键字,解决此类错误,首先打开Python3.7/Lib/site-packages/twisted/conch/manhole.py文件,然后将该文件中的所有"async"关键字修改成与关键字无关的标识符,如“async_“。
3.3 获取数据
Scrapy爬虫框架可以通过特定的CSS或者XPath表达式来选择HTML文件中的某一处,并且提取出相应的数据。CSS(Cascading Style Sheet,即层叠样式表),用于控制HTML页面的布局、字体、颜色、背景以及其他效果。XPath是一门可以在XML文档中根据元素和属性查找信息的语言。
3.3.1 CSS提取数据
使用CSS提取HTML文件中的某一处数据时,可以指定HTML文件中的标签名称。例如,获取前面示例网页中title标签数据时,可以使用如下命令:
response.css('title').extract()
示例代码如下:
#_*_coding:utf-8_*_
# 作者 :liuxiaowei
# 创建时间 :2/17/22 2:18 PM
# 文件 :css提取数据.py
# IDE :PyCharm
# 导入框架
import scrapy
class QuotesSpider(scrapy.Spider):
# 定义爬虫名称
name = 'quotes_3'
def start_requests(self):
# 设置爬取目标的地址
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
# 获取所有地址,有几个地址则发送几个请求
for url in urls:
# 发送请求
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# 获取title标签数据
title = response.css('title').extract()
# 打印title
print(title)
获取结果如如下:
liuxiaowei@MacBookAir spiders % scrapy crawl quotes_3
2022-02-17 14:25:03 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 14:25:03 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 14:25:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-17 14:25:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
['<title>Quotes to Scrape</title>']
2022-02-17 14:25:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
['<title>Quotes to Scrape</title>']
2022-02-17 14:25:05 [scrapy.core.engine] INFO: Spider closed (finished)
说 明
使用CSS提取数据时返回的内容为CSS表达式所对应节点的list列表,所以在提取标签中的数据时,可以使用以下的代码:
response.css('title::text').extract_first()
或者
response.css('title::text')[0].extract()
运行结果如下:
2022-02-17 14:32:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
['<title>Quotes to Scrape</title>']
Quotes to Scrape
3.3.2 XPath提取数据
使用XPath表达式提取HTML文件中的某一处数据时,需要根据XPath表达式的语法规定来获取指定的数据信息,例如,同样获取title标签那的信息时,可以使用如下命令:
response.xpath('//title/text()').extract_first()
通过示例实现使用XPath获取上面测试页中的多条信息,代码如下:
#_*_coding:utf-8_*_
# 作者 :liuxiaowei
# 创建时间 :2/17/22 2:37 PM
# 文件 :crawl_Xpath.py
# IDE :PyCharm
# 导入框架
import scrapy
class QuotesSpider(scrapy.Spider):
# 定义爬虫名称
name = 'quotes'
def start_requests(self):
# 设置爬取目标的地址
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
# 获取所有地址,有几个地址则发送几个请求
for url in urls:
# 发送请求
yield scrapy.Request(url=url, callback=self.parse)
# 响应信息
def parse(self, response):
# 获取所有信息
for quote in response.xpath(".//*[@class='quote']"):
# 获取名人名言文字信息
text = quote.xpath('.//*[@class="text"]/text()').extract_first()
# 获取作者
author = quote.xpath('.//*[@class="author"]/text()').extract_first()
# 获取标签
tags = quote.xpath('.//*[@class="tag"]/text()').extract()
# 以字典形式输出信息
print(dict(text=text, author=author, tags=tags))
运行结果如下:
liuxiaowei@MacBookAir spiders % scrapy crawl quotes
2022-02-17 14:38:57 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 14:38:57 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 14:38:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-17 14:38:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-02-17 14:38:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
2022-02-17 14:38:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
{'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
{'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
{'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}
{'text': "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”", 'author': 'Bob Marley', 'tags': ['love']}
{'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'tags': ['fantasy']}
{'text': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'author': 'Douglas Adams', 'tags': ['life', 'navigation']}
{'text': "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”", 'author': 'Elie Wiesel', 'tags': ['activism', 'apathy', 'hate', 'indifference', 'inspirational', 'love', 'opposite', 'philosophy']}
{'text': '“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”', 'author': 'Friedrich Nietzsche', 'tags': ['friendship', 'lack-of-friendship', 'lack-of-love', 'love', 'marriage', 'unhappy-marriage']}
{'text': '“Good friends, good books, and a sleepy conscience: this is the ideal life.”', 'author': 'Mark Twain', 'tags': ['books', 'contentment', 'friends', 'friendship', 'life']}
{'text': '“Life is what happens to us while we are making other plans.”', 'author': 'Allen Saunders', 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans']}
2022-02-17 14:38:59 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-17 14:38:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
2022-02-17 14:38:59 [scrapy.core.engine] INFO: Spider closed (finished)
说 明
Scrapy的选择对象中还提供了.re()方法,这是一种可以使用正则表达式提取数据的方法,可以直接通过response.xpath().re()方式进行调用,然后在.re()方法中填入对应的正则表达式即可。
3.3.4 翻页提取数据
如果需要获取整个网页的所有信息就需要使用翻页功能。例如获取上节测试页中的整个网站的作者名,示例代码如下:
#_*_coding:utf-8_*_
# 作者 :liuxiaowei
# 创建时间 :2/17/22 2:57 PM
# 文件 :翻页提取数据.py
# IDE :PyCharm
# 导入框架
import scrapy
class QuotesSpider(scrapy.Spider):
# 定义爬虫名称
name = 'quotes_4'
def start_requests(self):
# 设置爬取目标的地址
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
# 获取所有地址,有几个地址则发送几个请求
for url in urls:
# 发送请求
yield scrapy.Request(url=url, callback=self.parse)
# 响应信息
def parse(self, response):
# div.quote
# 获取所有信息
for quote in response.xpath('.//*[@class="quote"]'):
# 获取作者
author = quote.xpath('.//*[@class="author"]/text()').extract_first()
# 打印作者名称
print(author)
# 实现翻页
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
程序运行结果如下:
# 第一页的作者名称
2022-02-17 15:00:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin
# 下面是第10页的部分作者名称
2022-02-17 15:03:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/10/> (referer: http://quotes.toscrape.com/page/9/)
J.K. Rowling
Jimi Hendrix
J.M. Barrie
E.E. Cummings
Khaled Hosseini
Harper Lee
Madeleine L'Engle
Mark Twain
Dr. Seuss
George R.R. Martin
2022-02-17 15:03:54 [scrapy.core.engine] INFO: Closing spider (finished)
3.3.4 创建Items
爬取网页数据的过程,就是从非结构性的数据源中提取结构性数据。例如,在QuotesSpider类的parse()方法中已经获取到了text、author以及tags信息,如果需要将这些数据包装成结构化数据,那么就需要使用Scrapy所提供的Item类来满足这样的需求。Item对象是一个简单的容器,用于保存爬取到的数据信息,它提供了一个类似于字典的API,用于声明其可用字段的便捷语法,Item使用简单的类定义语法和Field对象来声明。在创建scrapyDemo项目时,项目的目录结构中就已经自动创建了一个items.py文件,用来定义存储数据信息的Item类,它需要继承scrapy.Item。示例代码如下:
import scrapy
class ScrapydemoItem(scrapy.Item):
# define the fields for your item here like:
# 定义获取名人名言文字信息
text = scrapy.Field()
# 定义获取的作者
author = scrapy.Field()
# 定义获取的标签
tags = scrapy.Field()
pass
示例代码如下:
#_*_coding:utf-8_*_
# 作者 :liuxiaowei
# 创建时间 :2/17/22 3:22 PM
# 文件 :包装结构化数据.py
# IDE :PyCharm
import scrapy # 导入框架
from scrapyDemo.items import ScrapydemoItem # 导入ScrapydemoItem类
class QuotesSpider(scrapy.Spider):
name = "quotes_5" # 定义爬虫名称
def start_requests(self):
# 设置爬取目标的地址
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
# 获取所有地址,有几个地址发送几次请求
for url in urls:
# 发送网络请求
yield scrapy.Request(url=url, callback=self.parse)
# 响应信息
def parse(self, response):
# 获取所有信息
for quote in response.xpath(".//*[@class='quote']"):
# 获取名人名言文字信息
text = quote.xpath(".//*[@class='text']/text()").extract_first()
# 获取作者
author = quote.xpath(".//*[@class='author']/text()").extract_first()
# 获取标签
tags = quote.xpath(".//*[@class='tag']/text()").extract()
# 创建Item对象
item = ScrapydemoItem(text=text, author=author, tags=tags)
yield item # 输出信息
class ScrapydemoItem(scrapy.Item):
# define the fields for your item here like:
# 定义获取名人名言文字信息
text = scrapy.Field()
# 定义获取的作者
author = scrapy.Field()
# 定义获取的标签
tags = scrapy.Field()
pass
程序运行结果如下:
liuxiaowei@MacBookAir spiders % scrapy crawl quotes_5
2022-02-17 15:30:04 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 15:30:04 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 15:30:04 [scrapy.core.engine] INFO: Spider opened
2022-02-17 15:30:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-17 15:30:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-17 15:30:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-02-17 15:30:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Albert Einstein',
'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
'text': '“The world as we have created it is a process of our thinking. It '
'cannot be changed without changing our thinking.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'J.K. Rowling',
'tags': ['abilities', 'choices'],
'text': '“It is our choices, Harry, that show what we truly are, far more '
'than our abilities.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Albert Einstein',
'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
'text': '“There are only two ways to live your life. One is as though nothing '
'is a miracle. The other is as though everything is a miracle.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Jane Austen',
'tags': ['aliteracy', 'books', 'classic', 'humor'],
'text': '“The person, be it gentleman or lady, who has not pleasure in a good '
'novel, must be intolerably stupid.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Marilyn Monroe',
'tags': ['be-yourself', 'inspirational'],
'text': "“Imperfection is beauty, madness is genius and it's better to be "
'absolutely ridiculous than absolutely boring.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Albert Einstein',
'tags': ['adulthood', 'success', 'value'],
'text': '“Try not to become a man of success. Rather become a man of value.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'André Gide',
'tags': ['life', 'love'],
'text': '“It is better to be hated for what you are than to be loved for what '
'you are not.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Thomas A. Edison',
'tags': ['edison', 'failure', 'inspirational', 'paraphrased'],
'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Eleanor Roosevelt',
'tags': ['misattributed-eleanor-roosevelt'],
'text': '“A woman is like a tea bag; you never know how strong it is until '
"it's in hot water.”"}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Steve Martin',
'tags': ['humor', 'obvious', 'simile'],
'text': '“A day without sunshine is like, you know, night.”'}
2022-02-17 15:30:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': 'Marilyn Monroe',
'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters'],
'text': "“This life is what you make it. No matter what, you're going to mess "
"up sometimes, it's a universal truth. But the good part is you get "
"to decide how you're going to mess it up. Girls will be your friends "
"- they'll act like it anyway. But just remember, some come, some go. "
"The ones that stay with you through everything - they're your true "
"best friends. Don't let go of them. Also remember, sisters make the "
"best friends in the world. As for lovers, well, they'll come and go "
'too. And baby, I hate to say it, most of them - actually pretty much '
"all of them are going to break your heart, but you can't give up "
"because if you give up, you'll never find your soulmate. You'll "
'never find that half who makes you whole and that goes for '
"everything. Just because you fail once, doesn't mean you're gonna "
'fail at everything. Keep trying, hold on, and always, always, always '
"believe in yourself, because if you don't, then who will, sweetie? "
'So keep your head high, keep your chin up, and most importantly, '
"keep smiling, because life's a beautiful thing and there's so much "
'to smile about.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': 'J.K. Rowling',
'tags': ['courage', 'friends'],
'text': '“It takes a great deal of bravery to stand up to our enemies, but '
'just as much to stand up to our friends.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': 'Albert Einstein',
'tags': ['simplicity', 'understand'],
'text': "“If you can't explain it to a six year old, you don't understand it "
'yourself.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': 'Bob Marley',
'tags': ['love'],
'text': '“You may not be her first, her last, or her only. She loved before '
'she may love again. But if she loves you now, what else matters? '
"She's not perfect—you aren't either, and the two of you may never be "
'perfect together but if she can make you laugh, cause you to think '
'twice, and admit to being human and making mistakes, hold onto her '
'and give her the most you can. She may not be thinking about you '
'every second of the day, but she will give you a part of her that '
"she knows you can break—her heart. So don't hurt her, don't change "
"her, don't analyze and don't expect more than she can give. Smile "
'when she makes you happy, let her know when she makes you mad, and '
"miss her when she's not there.”"}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': 'Dr. Seuss',
'tags': ['fantasy'],
'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a '
'necessary ingredient in living.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': 'Douglas Adams',
'tags': ['life', 'navigation'],
'text': '“I may not have gone where I intended to go, but I think I have '
'ended up where I needed to be.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': 'Elie Wiesel',
'tags': ['activism',
'apathy',
'hate',
'indifference',
'inspirational',
'love',
'opposite',
'philosophy'],
'text': "“The opposite of love is not hate, it's indifference. The opposite "
"of art is not ugliness, it's indifference. The opposite of faith is "
"not heresy, it's indifference. And the opposite of life is not "
"death, it's indifference.”"}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': 'Friedrich Nietzsche',
'tags': ['friendship',
'lack-of-friendship',
'lack-of-love',
'love',
'marriage',
'unhappy-marriage'],
'text': '“It is not a lack of love, but a lack of friendship that makes '
'unhappy marriages.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': 'Mark Twain',
'tags': ['books', 'contentment', 'friends', 'friendship', 'life'],
'text': '“Good friends, good books, and a sleepy conscience: this is the '
'ideal life.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': 'Allen Saunders',
'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans'],
'text': '“Life is what happens to us while we are making other plans.”'}
2022-02-17 15:30:05 [scrapy.core.engine] INFO: Closing spider (finished)