本人不懂爬虫技术。需要获取页面信息的时候,简单的需求直接用程序扒。复杂的需求己用chrome-mini向下渗透。前两天去参加了一个爬虫技术的聚会,发现这个领域非常有意思,所以回来搭建了个scrapy环境,打算学习学习。
由于本人是小白,所以搭建过程一步一个坎。这里记录下来,希望能帮助到其他新接触的同学。
一、安装python 3.7.9
centos7自带python2.7.5版本,我尝试在这个版本下直接安装scrapy2,搭建过程没有报错,但是运行程序的时候出问题了,说python2版本太低,所以只好安装python3了。
1、安装一次性开发工具,源码安装软件需要这个
yum -y groupinstall "Development tools"
2、安装各种依赖包
yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel
3、下载python3.7.9安装包
wget "https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz"
4、解压缩
tar -zxvf Python-3.7.9.tgz
5、进入根目录
cd Python-3.7.9
6、编译
./configure --prefix=/usr/local/python3
7、安装
make && make install
8、创建软链接
ln -s /usr/local/python3/bin/python3 /usr/bin/python3
ln -s /usr/local/python3/bin/pip3 /usr/bin/pip3
9、验证
[root@seeker-01 ~]# python
Python 2.7.5 (default, Aug 7 2019, 00:51:29)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> [root@seeker-01 ~]# python3
Python 3.7.9 (default, Aug 28 2020, 13:28:49)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
二、安装scrapy2
1、安装scrapy2
pip3 install scrapy
2、在python3 shell中验证scrapy2
[root@seeker-01 ~]# python3
Python 3.7.9 (default, Aug 28 2020, 13:28:49)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>> scrapy.version_info
(2, 3, 0)
>>>
3、创建软scrapy链接
ln -s /usr/local/python3/bin/scrapy /usr/bin/scrapy
4、验证scrapy2
[root@seeker-01 ~]# scrapy
Scrapy 2.3.0 - no active project Usage:
scrapy <command> [options] [args] Available commands:
bench Run quick benchmark test
commands
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
三、创建一个scrapy项目
1、创建项目命令
scrapy startproject 项目名称(我的项目叫mytest01)
[root@seeker-01 ~]# scrapy startproject mytest01
New Scrapy project 'my_test', using template directory '/usr/local/python3/lib/python3.7/site-packages/scrapy/templates/project', created in:
/root/mytest01 You can start your first spider with:
cd mytest01
scrapy genspider example example.com
执行完成后,我的/root目录下就创建了一个mytest01目录,
/root/mytest01/mytest01/spiders就是放置.py文件的位置,我创建的文件名是test01.py
2、写一个scrapy功能
在/root/mytest01/mytest01/spiders目录下创建一个test01.py文件,写入下面内容:
import scrapy
class mingyan(scrapy.Spider): #需要继承scrapy.Spider类
name = "mytest01" #蜘蛛名
def start_requests(self): #由此方法通过下面链接爬取页面
urls = [ #爬取的链接
'http://lab.scrapyd.cn/page/1/',
'http://lab.scrapyd.cn/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse) #爬取到的页面如何处理?提交给parse方法处理
def parse(self, response):
"""
start_requests已经爬取到页面,那如何提取我们想要的内容呢?那就可以在这个方法里面定义。
这里的话,并木有定义,只是简单的把页面做了一个保存,并没有涉及提取我们想要的数据,后面会慢慢说到
也就是用xpath、正则、或是css进行相应提取,这个例子就是让你看看scrapy运行的流程:
1、定义链接;
2、通过链接爬取(下载)页面;
3、定义规则,然后提取数据;
就是这么个流程,似不似很简单呀?
"""
page = response.url.split("/")[-2] #根据上面的链接提取分页,如:/page/1/,提取到的就是:1
filename = 'mingyan-%s.html' % page #拼接文件名,如果是第一页,最终文件名便是:mingyan-1.html
with open(filename, 'wb') as f: #python文件操作,不多说了;
f.write(response.body) #刚才下载的页面去哪里了?response.body就代表了刚才下载的页面!
self.log('保存文件: %s' % filename) # 打个日志
3、执行scrapy文件
scrapy crawl mytest01
2020-08-28 14:27:15 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: mingyan2)
2020-08-28 14:27:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.9 (default, Aug 28 2020, 13:28:49) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.1, Platform Linux-3.10.0-1062.12.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
2020-08-28 14:27:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-08-28 14:27:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'MyTest01',
'NEWSPIDER_MODULE': 'MyTest01.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['MyTest01.spiders']}
2020-08-28 14:27:15 [scrapy.extensions.telnet] INFO: Telnet Password: 168ea9bb811b4d38
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-28 14:27:15 [scrapy.core.engine] INFO: Spider opened
2020-08-28 14:27:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-28 14:27:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-28 14:27:15 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://lab.scrapyd.cn/robots.txt> (referer: None)
2020-08-28 14:27:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://lab.scrapyd.cn/page/1/> (referer: None)
2020-08-28 14:27:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://lab.scrapyd.cn/page/2/> (referer: None)
2020-08-28 14:27:15 [mingyan2] DEBUG: 保存文件: mingyan-1.html
2020-08-28 14:27:15 [mingyan2] DEBUG: 保存文件: mingyan-2.html
2020-08-28 14:27:15 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-28 14:27:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 666,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6436,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 0.344661,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 8, 28, 6, 27, 15, 851756),
'log_count/DEBUG': 5,
'log_count/INFO': 10,
'memusage/max': 47456256,
'memusage/startup': 47456256,
'response_received_count': 3,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 8, 28, 6, 27, 15, 507095)}
2020-08-28 14:27:15 [scrapy.core.engine] INFO: Spider closed (finished)
这时会在/root/MyTest01/MyTest01/spiders目录下出现两个文件:
mytest01-1.html
mytest01-2.html
此时说明环境创建成功。