• Scrapy是一个为了爬取网站数据、提取结构性数据而编写 的应用框架,可以应用在包括数据挖掘、信息处理或存储历史数据等一系列的程序中。

Scrapy架构

  • Scrapy的整体架构由Scrapy引擎(Scrapy Engine)、调度器(Scheduler)、下载器(Downloader)、爬虫(Spiders)和数据项管道(Item Pipeline)5个组件和两个中间件构成。
  • Scrapy引擎(Scrapy Engine):是整个系统的核心,负责控制数据在整个组件中的流动,并在相应动作发生时出发事件。
  • 调度器(Scheduler):管理Request请求的出入栈,去除重复的请求。调度器从Scrapy引擎接收请求,并将请求加如请求队列,以便在后期需要的时候提交给Scrapy引擎。
  • 下载器(Downloader):负责获取页面数据,并通过Scrapy引擎提供给网络爬虫。
  • 网络爬虫(Spiders):是Scrapy用户编写的用于分析结果并提取数据项或跟进的URL的类。每个爬虫负责处理一个(或者一组)特定网站。
  • 数据项管道(Item Pipeline):负责处理被爬虫提取出来的数据项。典型的处理由清理、验证及持久化。
  • 下载器中间件:是引擎和下载器之间的特定接口,处理下载器传递给引擎的结果。其通过插入自定义代码来扩展下载器的功能。
  • 爬虫中间件:是引擎和爬虫之间的特定接口,用来处理爬虫的输入,并输出数据项。其通过插入自定义代码来扩展爬虫的功能。

Scrapy中的数据流由Scrapy引擎控制,整体的流程如下。

  • Scrapy引擎打开一个网站,找到处理该网站的爬虫,并询问爬虫第一次要爬取的URL。
  • Scrapy引擎从爬虫中获取第一次要爬取的URL,并以Request方式发送给调度器。
  • Scrapy引擎项调度器请求下一个要爬取的URL。
  • 调度器返回下一个要爬取的URL给Scrapy引擎,Scrapy引擎将URL通过下载器中间件转发给下载器。
  • 下载器下载给定的网页,下载完毕后,生成一个该页面的结果,并将其通过下载器中间件发送给Scrapy引擎。
  • Scrapy引擎从下载器中接收到下载结果,并通过爬虫中间件发送给爬虫进行处理。
  • 爬虫对结果进行处理,并返回爬取到的数据项及需要跟进的新的URL给Scrapy引擎。
  • Scrapy引擎将爬取到的数据项发送给数据项管道,将爬虫生成的新的请求发送给调度器。
  • 从步骤2开始重复,直到调度器中没有更多请求,Scrapy引擎关闭该网站。

安装Scrapy

CentOS-7
Python3.6.6

安装Python3.6.6
  • root用户执行如下命令,进行安装其他库
[root@localhost ~]# yum -y groupinstall "Development tools"
[root@localhost ~]#  yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel
  • 下载python3.6.6
[root@localhost ~]# wget https://www.python.org/ftp/python/3.6.6/Python-3.6.6.tgz
  • 解压到/opt/ 目录下
[root@localhost tarPackage]# tar -zxvf Python-3.6.6.tgz -C /opt/
  • 进入根目录进行编译安装
[root@localhost ~]# cd /opt/Python-3.6.6/
[root@localhost Python-3.6.6]# ./configure --prefix=/usr/local/python3
[root@localhost Python-3.6.6]# make && make install
  • 创建python3和pip3软连接,注意:一定要进入root用户根目录 ~下执行该命令
[root@localhost ~]# ln -s /usr/local/python3/bin/python3 /usr/bin/python3
[root@localhost ~]# ln -s /usr/local/python3/bin/pip3 /usr/bin/pip3
  • 验证Python3是否安装成功
[root@localhost ~]# python3
Python 3.6.6 (default, Jul 20 2020, 07:34:28) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
  • 验证pip3是否安装成功
[root@localhost ~]# pip3

Usage:   
  pip3 <command> [options]

Commands:
  install                     Install packages.
  download                    Download packages.
  uninstall                   Uninstall packages.
  freeze                      Output installed packages in requirements format.
  list                        List installed packages.
  show                        Show information about installed packages.
  check                       Verify installed packages have compatible dependencies.
  config                      Manage local and global configuration.
  search                      Search PyPI for packages.
  wheel                       Build wheels from your requirements.
  hash                        Compute hashes of package archives.
  completion                  A helper command used for command completion.
  help                        Show help for commands.

General Options:
  -h, --help                  Show help.
  --isolated                  Run pip in an isolated mode, ignoring environment variables and user configuration.
  -v, --verbose               Give more output. Option is additive, and can be used up to 3 times.
  -V, --version               Show version and exit.
  -q, --quiet                 Give less output. Option is additive, and can be used up to 3 times (corresponding
                              to WARNING, ERROR, and CRITICAL logging levels).
  --log <path>                Path to a verbose appending log.
  --proxy <proxy>             Specify a proxy in the form [user:passwd@]proxy.server:port.
  --retries <retries>         Maximum number of retries each connection should attempt (default 5 times).
  --timeout <sec>             Set the socket timeout (default 15 seconds).
  --exists-action <action>    Default action when a path already exists: (s)witch, (i)gnore, (w)ipe, (b)ackup,
                              (a)bort).
  --trusted-host <hostname>   Mark this host as trusted, even though it does not have valid or any HTTPS.
  --cert <path>               Path to alternate CA bundle.
  --client-cert <path>        Path to SSL client certificate, a single file containing the private key and the
                              certificate in PEM format.
  --cache-dir <dir>           Store the cache data in <dir>.
  --no-cache-dir              Disable the cache.
  --disable-pip-version-check
                              Don't periodically check PyPI to determine whether a new version of pip is available
                              for download. Implied with --no-index.
  --no-color                  Suppress colored output
安装Scrapy
[root@localhost ~]# pip3 install scrapy
  • 验证安装是否成功,import scrapy后没有报错。
[root@localhost ~]# python3
Python 3.6.6 (default, Jul 20 2020, 07:34:28) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>>
  • 创建Scrapy软连接
[root@localhost ~]# ln -s /usr/local/python3/bin/scrapy /usr/bin/scrapy
  • 验证Scrapy
[root@localhost ~]# scrapy -v
Scrapy 2.2.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands      
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

Scrapy应用案例

  • 如果需要从某个网站中获取信息,但该网站未提供API或能通过程序获取信息的机制,Scrapy就可以用来完成这个任务
  • 获取在当当网站销售的有关 "Python核心编程"和 "Python基础教程"的所有书籍的URL、名字、描述及价格等信息。
    1)创建Scrapy项目

创建存储代码的目录

[root@localhost opt]# mkdir -p project/tutorial

进入刚创建好的目录

[root@localhost opt]# cd project/tutorial

创建项目

[root@localhost tutorial]# scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/local/python3/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /opt/project/tutorial/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com

使用tree命令查看tutorial项目结构

[root@localhost tutorial]# tree tutorial/
tutorial/
+-- scrapy.cfg
+-- tutorial
    +-- __init__.py
    +-- items.py
    +-- middlewares.py
    +-- pipelines.py
    +-- settings.py
    +-- spiders
        +-- __init__.py

2 directories, 7 files

2)定义Item

  • 在Scrapy中,Item是保存抓取到的数据的容器,其使用方法和Python字典类似,并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。一般,Item可以用scrapy.item.Item类来创建,并且用scrapy.item.Field对象来定义属性。
  • 编辑 tutorial/ 目录中的items.py文件
[root@localhost tutorial]# vim items.py 

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()
    price = scrapy.Field()
~
~
~
~
"items.py" 15L, 361C written                                                                      
[root@localhost tutorial]#

3)编写Spider

  • Spider是用户编写的用于从单个(或一组)网站爬取数据的类。它包含了一个用于下载的初始URL,负责跟进网页中的链接的方法,负责分析页面中的内容的方法,以及负责提取生成的Item的方法。
  • 创建Spider必须继承scrapy.Spider类,并且需要定义以下3个属性:
    name – Spider的名字,必须是唯一的,不可以为不同的Spider设定相同的名字。
    start_urls – 包含了Spider在启动时进行爬取的URL列表。第一个被获取到页面是其中之一,后续的URL则从初始的URL获取到的数据中提取。
    parse() – 是一个用来解析下载返回数据的方法。被调用时,每个初始URL完成下载后生成的Response对象将会作为唯一的参数传递给该方法。该方法将负责解析返回的数据(Response),提取数据生成Item,以及生成需要进一步处理的URL的Request对象。
  • 编写Spider代码
[root@localhost spiders]# vim dang_spider.py
import scrapy

class DangSpider(scrapy.Spider):
    name = "dangdang"
    allowed_domains = ["dangdang.com"]
    start_urls = [
        http://search.dangdang.com/?key=python&act=click,
        http://search.dangdang.com/?key=java&act=click
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)

4)爬取
进入项目的根目录,启动Spider
进入tutorial/ 目录

[root@localhost tutorial]# ll
total 364
-rw-r--r--. 1 root root      0 Jul 20 08:52 __init__.py
-rw-r--r--. 1 root root    361 Jul 20 09:26 items.py
-rw-r--r--. 1 root root   3652 Jul 20 09:09 middlewares.py
-rw-r--r--. 1 root root    362 Jul 20 09:09 pipelines.py
drwxr-xr-x. 2 root root     68 Jul 20 09:42 __pycache__
-rw-r--r--. 1 root root 354475 Jul 20 09:58 search.dangdang.com
-rw-r--r--. 1 root root   3079 Jul 20 09:09 settings.py
drwxr-xr-x. 3 root root     66 Jul 20 09:46 spiders

启动Spider

[root@localhost tutorial]# scrapy crawl dangdang
2020-07-20 10:07:26 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: tutorial)
2020-07-20 10:07:26 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.6 (default, Jul 20 2020, 07:34:28) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Linux-3.10.0-862.el7.x86_64-x86_64-with-centos-7.5.1804-Core
2020-07-20 10:07:26 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-20 10:07:26 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders']}
2020-07-20 10:07:26 [scrapy.extensions.telnet] INFO: Telnet Password: 70e29b096d1bc1a3
2020-07-20 10:07:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-07-20 10:07:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-20 10:07:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-20 10:07:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-20 10:07:26 [scrapy.core.engine] INFO: Spider opened
2020-07-20 10:07:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-20 10:07:26 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-20 10:07:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://search.dangdang.com/robots.txt> (failed 1 times): DNS lookup failed: no results for hostname lookup: search.dangdang.com.
2020-07-20 10:08:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://search.dangdang.com/robots.txt> (failed 2 times): DNS lookup failed: no results for hostname lookup: search.dangdang.com.
2020-07-20 10:08:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://search.dangdang.com/robots.txt> (referer: None)
2020-07-20 10:08:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://search.dangdang.com/?key=java&act=click> (referer: None)
2020-07-20 10:08:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://search.dangdang.com/?key=python&act=click> (referer: None)
2020-07-20 10:08:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-20 10:08:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
 'downloader/request_bytes': 1165,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 5,
 'downloader/response_bytes': 158214,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'elapsed_time_seconds': 42.498154,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 7, 20, 2, 8, 9, 366555),
 'log_count/DEBUG': 5,
 'log_count/INFO': 10,
 'memusage/max': 46374912,
 'memusage/startup': 46374912,
 'response_received_count': 3,
 'retry/count': 2,
 'retry/reason_count/twisted.internet.error.DNSLookupError': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2020, 7, 20, 2, 7, 26, 868401)}
2020-07-20 10:08:09 [scrapy.core.engine] INFO: Spider closed (finished)

5)提取Item

  • 编辑spiders/dang_spider.py文件
[root@localhost tutorial]# vim spiders/dang_spider.py 
import scrapy
from scrapy.selector import Selector
from tutorial.items import DangItem

class DangSpider(scrapy.Spider):
    name = "dangdang"
    allowed_domains = ["dangdang.com"]
    start_urls = ['http://search.dangdang.com/?key=python&act=click','http://search.dangdang.com/?key=java&act=click']

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//ul/li')
        items = []
        for site in sites:
            item = DangItem()
            item['title'] = site.xpath('a/text()').extract()
            item['link'] = site.xpath('a/@href').extract()
            item['desc'] = site.xpath('text()').extract()
            item['price'] = site.xpath('text()').extract()
            items.append(item)
        return items
  • 增加json选项把结果保存为json格式
[root@localhost tutorial]# scrapy crawl dangdang -o items.json
  • 查看爬取的内容
[root@localhost tutorial]# cat items.json