CentOS下安装Scrapy

原创

小诺N 2013-02-27 14:45:33 ©著作权

文章标签 centos scrapy 文章分类 开源

©著作权归作者所有：来自51CTO博客作者小诺N的原创作品，请联系作者获取转载授权，否则将追究法律责任

Scrapy是一个开源的机遇twisted框架的python的单机爬虫，该爬虫实际上包含大多数网页抓取的工具包，用于爬虫下载端以及抽取端。

安装环境:

centos5.4  
python2.7.3

安装步骤:

1.下载python2.7 http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz

[root@zxy-websgs ~]# wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz -P /opt  
[root@zxy-websgs opt]# tar xvf Python-2.7.3.tgz 
[root@zxy-websgs Python-2.7.3]# ./configure 
[root@zxy-websgs Python-2.7.3]# make && make install

　验证python2.7安装

[root@zxy-websgs Python-2.7.3]# python2.7 
Python 2.7.3 (default, Feb 28 2013, 03:08:43)   
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2 Type "help", "copyright",
"credits" or "license" for more information.  
>>> exit()

2.安装setuptools

[root@zxy-websgs ~]# wget http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz -P /opt/  
[root@zxy-websgs opt]# tar zxvf setuptools-0.6c11.tar.gz   
[root@zxy-websgs setuptools-0.6c11]# python2.7 setup.py  install

setuptools:http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz

3.安装Twisted

[root@zxy-websgs setuptools-0.6c11]# easy_install Twisted  
......  
Installed /usr/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg  
......  
Installed /usr/local/lib/python2.7/site-packages/zope.interface-4.0.4-py2.7-linux-x86_64.egg

Twisted要安装zope.interface,可以从下面地址下载

zope.interface:http://pypi.python.org/packages/source/z/zope.interface/zope.interface-4.0.1.tar.gz

twisted:http://twistedmatrix.com/Releases/Twisted/12.1/Twisted-12.1.0.tar.bz2

5.安装w3lib

[root@zxy-websgs setuptools-0.6c11]# easy_install -U w3lib  
Searching for w3lib Reading http://pypi.python.org/simple/w3lib/  
Reading http://github.com/scrapy/w3lib Best match: w3lib 1.2  
Downloading http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz#md5=f929d5973a9fda59587b09a72f185a9e 
Processing w3lib-1.2.tar.gz 
Running w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-wm_1BB/w3lib-1.2/egg-dist-tmp-2DQHY_ zip_safe flag not set;  
analyzing archive contents... Adding w3lib 1.2 to easy-install.pth file   
Installed /usr/local/lib/python2.7/site-packages/w3lib-1.2-py2.7.egg  
Processing dependencies for w3lib Finished processing dependencies for w3lib

w3lib:http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz

6.安装libxml2或者用easy_install安装lxml

[root@zxy-websgs lxml-3.1.0]# easy_install lxml

验证lxml安装

[root@zxy-websgs lxml-3.1.0]# python2.7  
Python 2.7.3 (default, Feb 28 2013, 03:08:43)   
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2 Type "help", "copyright", "credits" or "license" for more information. 
>>> import lxml 
>>> exit()

也可以安装libxml2,官网上推荐安装2.6.28或者以上的版本，但在官网上没找到，我先是安装的2.6.9的版本，运行scrapy时报以下错误

Traceback (most recent call last): 
  File "/usr/local/bin/scrapy", line 5, in <module> 
    pkg_resources.run_script('Scrapy==0.14.4', 'scrapy') 
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script 
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in run_script 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module> 
    execute() 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 112, in execute 
    cmds = _get_commands_dict(inproject) 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict 
    cmds = _get_commands_from_module('scrapy.commands', inproject) 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module 
    for cmd in _iter_command_classes(module): 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes 
    for module in walk_modules(module_name): 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules 
    submod = __import__(fullpath, {}, {}, ['']) 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/commands/shell.py", line 8, in <module> 
    from scrapy.shell import Shell 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/shell.py", line 14, in <module> 
    from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module> 
    from scrapy.selector.libxml2sel import * 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module> 
    from .factories import xmlDoc_from_html, xmlDoc_from_xml 
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/factories.py", line 14, in <module> 
    libxml2.HTML_PARSE_NOERROR + \ 
AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'

升级到2.6.21版本以后解决了。

libxml2.6.1:ftp://xmlsoft.org/libxml2/python/libxml2-python-2.6.21.tar.gz

7.安装pyOpenSSL(这个是可选安装的，主要为了使scrapy能够支持https)

用easy_install pyOpenSSL安装的是pyOpenSSL-0.13版本，没安装成功，于是手动下载.011版本来进行安装。

[root@zxy-websgs opt]# wget http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz -P /opt 
[root@zxy-websgs opt]# tar zxvf pyOpenSSL-0.11.tar.gz  
[root@zxy-websgs pyOpenSSL-0.11]# python2.7 setup.py install

pyOpenSSL:http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz

8.安装scrapy

[root@zxy-websgs pyOpenSSL-0.11]# easy_install -U Scrapy

验证安装

[root@zxy-websgs pyOpenSSL-0.11]# scrapy 
Scrapy 0.16.4 - no active project 
 
Usage: 
  scrapy <command> [options] [args] 
 
Available commands: 
  fetch         Fetch a URL using the Scrapy downloader 
  runspider     Run a self-contained spider (without creating a project) 
  settings      Get settings values 
  shell         Interactive scraping console 
  startproject  Create new project 
  version       Print Scrapy version 
  view          Open URL in browser, as seen by Scrapy 
 
  [ more ]      More commands available when run from project directory 
 
Use "scrapy <command> -h" to see more info about a command