很多读者可能都知道用scrapy框架进行数据爬取,本文将用博客园首页为例子,详细讲解爬虫抓取数据的全过程。

本文内容包括:

  • 爬虫环境的搭建
  • 爬虫的代码细节
  • 常见的问题
一、环境搭建

1、安装scrapy

pip install scrapy

2、安装Docker(windows)

  • 开启CPU虚拟化

docker爬虫流程图_Docker

  • 开启Hyper-V
    步骤:控制面板-程序-启动和关闭windows功能

docker爬虫流程图_Docker_02

  • 关闭防火墙
  • 获取安装程序
    官网下载地址:https://www.docker.com/products/docker-desktop
    读者也可以关注公众号哥说网事,发送138,快速获取网盘下载链接和提取码
  • 安装过程与普通的windows程序安装无异,请读者自行操作。
  • 启动Docker
    从开始菜单启动Docker Desktop,右下角会有Docker的图标,等待几分钟后,鼠标悬浮会提示Docker Desktop is running

3、安装Splash

  • 在命令工具中运行下面命令,等待一段时间直到安装成功。
docker pull scrapinghub/splash
  • 启动Splash
docker run -p 8050:8050 scrapinghub/splash

可以在浏览器中输入http://localhost:8050/验证一下是否成功

4、安装Scrapy-Splash

pip install scrapy-splash

Scrapy-Splash是一个Python库,安装后就可在Scrapy中使用Splash服务了

二、开发自己的爬虫

scrapy框架的介绍和基本使用,请看教程http://www.scrapyd.cn/doc/140.html,写得浅显易懂。

1、建立新的项目

首先进入你的工程目录,如D:\dev\projects\,接着运行下面命令

scrapy startproject myspider

2、进入myspider

cd myspider

3、利用默认模板建立cnblogs的爬虫

scrapy genspider cnblogs

4、打开settings.py,进行Splash配置

参考官方文档:https://github.com/scrapy-plugins/scrapy-splash

我项目的配置如下:

# -*- coding: utf-8 -*-

# Scrapy settings for myspider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'myspider'

SPIDER_MODULES = ['myspider.spiders']
NEWSPIDER_MODULE = 'myspider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'myspider (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   # 'myspider.middlewares.MyspiderSpiderMiddleware': 543,
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    #'myspider.middlewares.MyspiderDownloaderMiddleware': 543,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'myspider.pipelines.MyspiderPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
SPLASH_URL = 'http://localhost:8050'      #自己安装的docker里的splash
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

4、编写爬虫逻辑

打开cnblogs.py的文件,代码修改如下

# -*- coding: utf-8 cnblogs.py-*-
import scrapy
from scrapy_splash import SplashRequest #重新定义了请求
from myspider.spiders.db import Mysql
from myspider.spiders.db import DBTableItem
import sys

class CnblogsSpider(scrapy.Spider):
    name = 'cnblogs'
    start_urls = ['']

    def start_requests(self):  # 由此方法通过下面链接爬取页面
        script = '''
                    function main(splash, args)
                      splash:go(args.url)
                      local scroll_to = splash:jsfunc("window.scrollTo")
                      scroll_to(0, 2800)
                      splash:set_viewport_full()
                      splash:wait(5)
                      return {html=splash:html()}
                    end
                '''

        url = ""
        yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': script, 'url': url})


    def parse(self, response): #页面解析函数,这里我们使用xpath
        # 文章链接地址
        url_list = response.xpath('//div[@class="post_item_body"]/h3/a/@href').extract()
      
        # 文章标题
        title_list = response.xpath('//div[@class="post_item_body"]/h3/a/text()').extract()
      
        # 文章简介
        desc_list = []
        desc_list_tmp = response.xpath('//p[@class="post_item_summary"]/text()').extract()
        for data in desc_list_tmp:
            if data.isspace() is False:  #检测简介是否为空
                desc_list.append(data)
                
        # 作者
        author_list = response.xpath('//div[@class="post_item_foot"]/a/text()').extract()
       
        item_list = []

        for url, title, description, author in zip(url_list, title_list, desc_list, author_list):
            item = DBTableItem(url, title, description, author)
            item_list.append(item)

        for data in item_list:
            self.log(data.description)

        mysql = Mysql()
        mysql.add2cnblogs(item_list)
        mysql.end()

数据库操作的代码如下,为了避免爬取重复数据,我们将url字段设置成唯一性索引,在插入时用ON DUPLICATE KEY UPDATE

import pymysql
import scrapy

class DBTableItem(object):
    # 数据库表对象
    def __init__(self, url, title, description, author):
        self.title = title  
        self.author = author
        self.url = url
        self.description = description
        self.category = ""
        
#操作mysql的类
class Mysql(scrapy.Spider):
    def __init__(self):
        self.connect = pymysql.Connect(
            host='192.168.1.109',  # mysql的主机ip
            port=3306,  # 端口
            user='root',  # 用户名
            passwd='qwertyuiop',  # 数据库密码
            db='myspider',  # 数据库名
            charset='utf8mb4',  # 字符集
        )
        self.cursor = self.connect.cursor()

    def add2cnblogs(self, item_list):
        for data in item_list:
            # sql语句 表名为cnblogs ,注意%s的这种占位符操作,这里用到了
            sql = "insert into cnblogs (title, author, url, description) values (%s,%s,%s,%s) ON DUPLICATE KEY UPDATE title=%s, author=%s, url=%s, description=%s"
            # 参数化方式传参,8个%s就要有8个参数与之对应
            row_count = self.cursor.execute(sql, [data.title, data.author, data.url, data.description, data.title, data.author, data.url, data.description])
        # 统一提交
        self.connect.commit()

    def end(self):
        self.cursor.close()
        self.connect.close()
三、FAQ

1、Splash的作用什么?

Splash是一个javascript渲染服务。它是一个带有HTTP API的轻量级Web浏览器,通过它的接口来实现JavaScript页面的加载。

2、为什么要装Docker

因为在windows下Splash的运行是基于Docker的。