本人不懂爬虫技术。需要获取页面信息的时候,简单的需求直接用程序扒。复杂的需求己用chrome-mini向下渗透。前两天去参加了一个爬虫技术的聚会,发现这个领域非常有意思,所以回来搭建了个scrapy环境,打算学习学习。
由于本人是小白,所以搭建过程一步一个坎。这里记录下来,希望能帮助到其他新接触的同学。

一、安装python 3.7.9

centos7自带python2.7.5版本,我尝试在这个版本下直接安装scrapy2,搭建过程没有报错,但是运行程序的时候出问题了,说python2版本太低,所以只好安装python3了。

1、安装一次性开发工具,源码安装软件需要这个
    yum -y groupinstall "Development tools"

2、安装各种依赖包
    yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel

3、下载python3.7.9安装包
    wget "https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz"

4、解压缩
    tar -zxvf Python-3.7.9.tgz

5、进入根目录
    cd Python-3.7.9

6、编译
    ./configure --prefix=/usr/local/python3

7、安装
    make && make install

8、创建软链接
    ln -s /usr/local/python3/bin/python3 /usr/bin/python3
    ln -s /usr/local/python3/bin/pip3 /usr/bin/pip3

9、验证

[root@seeker-01 ~]# python
     Python 2.7.5 (default, Aug  7 2019, 00:51:29) 
     [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
     Type "help", "copyright", "credits" or "license" for more information.
     >>>    [root@seeker-01 ~]# python3
     Python 3.7.9 (default, Aug 28 2020, 13:28:49) 
     [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
     Type "help", "copyright", "credits" or "license" for more information.
     >>>

二、安装scrapy2

1、安装scrapy2
    pip3 install scrapy

2、在python3 shell中验证scrapy2

[root@seeker-01 ~]# python3
     Python 3.7.9 (default, Aug 28 2020, 13:28:49) 
     [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
     Type "help", "copyright", "credits" or "license" for more information.
     >>> import scrapy
     >>> scrapy.version_info
     (2, 3, 0)
     >>>

3、创建软scrapy链接
    ln -s /usr/local/python3/bin/scrapy  /usr/bin/scrapy

4、验证scrapy2
  

[root@seeker-01 ~]# scrapy
     Scrapy 2.3.0 - no active project    Usage:
       scrapy <command> [options] [args]    Available commands:
       bench         Run quick benchmark test
       commands      
       fetch         Fetch a URL using the Scrapy downloader
       genspider     Generate new spider using pre-defined templates
       runspider     Run a self-contained spider (without creating a project)
       settings      Get settings values
       shell         Interactive scraping console
       startproject  Create new project
       version       Print Scrapy version
       view          Open URL in browser, as seen by Scrapy      [ more ]      More commands available when run from project directory
    Use "scrapy <command> -h" to see more info about a command

三、创建一个scrapy项目

1、创建项目命令
    scrapy startproject 项目名称(我的项目叫mytest01)

[root@seeker-01 ~]# scrapy startproject mytest01
     New Scrapy project 'my_test', using template directory '/usr/local/python3/lib/python3.7/site-packages/scrapy/templates/project', created in:
         /root/mytest01    You can start your first spider with:
         cd mytest01
         scrapy genspider example example.com

    执行完成后,我的/root目录下就创建了一个mytest01目录,
    /root/mytest01/mytest01/spiders就是放置.py文件的位置,我创建的文件名是test01.py

2、写一个scrapy功能
    在/root/mytest01/mytest01/spiders目录下创建一个test01.py文件,写入下面内容:

import scrapy
class mingyan(scrapy.Spider): #需要继承scrapy.Spider类
    name = "mytest01" #蜘蛛名
     def start_requests(self): #由此方法通过下面链接爬取页面
         urls = [ #爬取的链接
             'http://lab.scrapyd.cn/page/1/',
             'http://lab.scrapyd.cn/page/2/',
         ]


        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse) #爬取到的页面如何处理?提交给parse方法处理

    def parse(self, response):

        """
        start_requests已经爬取到页面,那如何提取我们想要的内容呢?那就可以在这个方法里面定义。
        这里的话,并木有定义,只是简单的把页面做了一个保存,并没有涉及提取我们想要的数据,后面会慢慢说到
        也就是用xpath、正则、或是css进行相应提取,这个例子就是让你看看scrapy运行的流程:
        1、定义链接;
        2、通过链接爬取(下载)页面;
        3、定义规则,然后提取数据;
        就是这么个流程,似不似很简单呀?

"""
         page = response.url.split("/")[-2]     #根据上面的链接提取分页,如:/page/1/,提取到的就是:1
         filename = 'mingyan-%s.html' % page    #拼接文件名,如果是第一页,最终文件名便是:mingyan-1.html
         with open(filename, 'wb') as f:        #python文件操作,不多说了;
             f.write(response.body)             #刚才下载的页面去哪里了?response.body就代表了刚才下载的页面!
         self.log('保存文件: %s' % filename)      # 打个日志

3、执行scrapy文件
    scrapy crawl mytest01

2020-08-28 14:27:15 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: mingyan2)
2020-08-28 14:27:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.9 (default, Aug 28 2020, 13:28:49) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.1, Platform Linux-3.10.0-1062.12.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
2020-08-28 14:27:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-08-28 14:27:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'MyTest01',
 'NEWSPIDER_MODULE': 'MyTest01.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['MyTest01.spiders']}
2020-08-28 14:27:15 [scrapy.extensions.telnet] INFO: Telnet Password: 168ea9bb811b4d38
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-28 14:27:15 [scrapy.core.engine] INFO: Spider opened
2020-08-28 14:27:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-28 14:27:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-28 14:27:15 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://lab.scrapyd.cn/robots.txt> (referer: None)
2020-08-28 14:27:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://lab.scrapyd.cn/page/1/> (referer: None)
2020-08-28 14:27:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://lab.scrapyd.cn/page/2/> (referer: None)
2020-08-28 14:27:15 [mingyan2] DEBUG: 保存文件: mingyan-1.html
2020-08-28 14:27:15 [mingyan2] DEBUG: 保存文件: mingyan-2.html
2020-08-28 14:27:15 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-28 14:27:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 666,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 6436,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 0.344661,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 8, 28, 6, 27, 15, 851756),
 'log_count/DEBUG': 5,
 'log_count/INFO': 10,
 'memusage/max': 47456256,
 'memusage/startup': 47456256,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2020, 8, 28, 6, 27, 15, 507095)}
2020-08-28 14:27:15 [scrapy.core.engine] INFO: Spider closed (finished)

这时会在/root/MyTest01/MyTest01/spiders目录下出现两个文件:
mytest01-1.html
mytest01-2.html
此时说明环境创建成功。