docker爬虫流程图

转载

桃太郎 2024-10-10 08:23:19

文章标签 docker爬虫流程图 ide html Docker 文章分类 Docker 云计算

很多读者可能都知道用scrapy框架进行数据爬取，本文将用博客园首页为例子，详细讲解爬虫抓取数据的全过程。
本文内容包括：
爬虫环境的搭建
爬虫的代码细节
常见的问题

一、环境搭建

1、安装scrapy

pip install scrapy

2、安装Docker（windows）

开启CPU虚拟化

docker爬虫流程图_Docker

开启Hyper-V
步骤：控制面板-程序-启动和关闭windows功能

docker爬虫流程图_Docker_02

关闭防火墙
获取安装程序
官网下载地址：https://www.docker.com/products/docker-desktop
读者也可以关注公众号哥说网事，发送138，快速获取网盘下载链接和提取码
安装过程与普通的windows程序安装无异，请读者自行操作。
启动Docker
从开始菜单启动Docker Desktop，右下角会有Docker的图标，等待几分钟后，鼠标悬浮会提示Docker Desktop is running

3、安装Splash

在命令工具中运行下面命令，等待一段时间直到安装成功。

docker pull scrapinghub/splash

启动Splash

docker run -p 8050:8050 scrapinghub/splash

可以在浏览器中输入http://localhost:8050/验证一下是否成功

4、安装Scrapy-Splash

pip install scrapy-splash

Scrapy-Splash是一个Python库，安装后就可在Scrapy中使用Splash服务了

二、开发自己的爬虫

scrapy框架的介绍和基本使用，请看教程http://www.scrapyd.cn/doc/140.html，写得浅显易懂。

1、建立新的项目

首先进入你的工程目录，如D:\dev\projects\，接着运行下面命令

scrapy startproject myspider

2、进入myspider

cd myspider

3、利用默认模板建立cnblogs的爬虫

scrapy genspider cnblogs

4、打开settings.py，进行Splash配置

参考官方文档：https://github.com/scrapy-plugins/scrapy-splash

我项目的配置如下：

# -*- coding: utf-8 -*-

# Scrapy settings for myspider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'myspider'

SPIDER_MODULES = ['myspider.spiders']
NEWSPIDER_MODULE = 'myspider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'myspider (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   # 'myspider.middlewares.MyspiderSpiderMiddleware': 543,
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    #'myspider.middlewares.MyspiderDownloaderMiddleware': 543,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'myspider.pipelines.MyspiderPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
SPLASH_URL = 'http://localhost:8050'      #自己安装的docker里的splash
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

4、编写爬虫逻辑

打开cnblogs.py的文件，代码修改如下

# -*- coding: utf-8 cnblogs.py-*-
import scrapy
from scrapy_splash import SplashRequest #重新定义了请求
from myspider.spiders.db import Mysql
from myspider.spiders.db import DBTableItem
import sys

class CnblogsSpider(scrapy.Spider):
    name = 'cnblogs'
    start_urls = ['']

    def start_requests(self):  # 由此方法通过下面链接爬取页面
        script = '''
                    function main(splash, args)
                      splash:go(args.url)
                      local scroll_to = splash:jsfunc("window.scrollTo")
                      scroll_to(0, 2800)
                      splash:set_viewport_full()
                      splash:wait(5)
                      return {html=splash:html()}
                    end
                '''

        url = ""
        yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': script, 'url': url})


    def parse(self, response): #页面解析函数，这里我们使用xpath
        # 文章链接地址
        url_list = response.xpath('//div[@class="post_item_body"]/h3/a/@href').extract()
      
        # 文章标题
        title_list = response.xpath('//div[@class="post_item_body"]/h3/a/text()').extract()
      
        # 文章简介
        desc_list = []
        desc_list_tmp = response.xpath('//p[@class="post_item_summary"]/text()').extract()
        for data in desc_list_tmp:
            if data.isspace() is False:  #检测简介是否为空
                desc_list.append(data)
                
        # 作者
        author_list = response.xpath('//div[@class="post_item_foot"]/a/text()').extract()
       
        item_list = []

        for url, title, description, author in zip(url_list, title_list, desc_list, author_list):
            item = DBTableItem(url, title, description, author)
            item_list.append(item)

        for data in item_list:
            self.log(data.description)

        mysql = Mysql()
        mysql.add2cnblogs(item_list)
        mysql.end()

数据库操作的代码如下，为了避免爬取重复数据，我们将url字段设置成唯一性索引，在插入时用ON DUPLICATE KEY UPDATE

import pymysql
import scrapy

class DBTableItem(object):
    # 数据库表对象
    def __init__(self, url, title, description, author):
        self.title = title  
        self.author = author
        self.url = url
        self.description = description
        self.category = ""
        
#操作mysql的类
class Mysql(scrapy.Spider):
    def __init__(self):
        self.connect = pymysql.Connect(
            host='192.168.1.109',  # mysql的主机ip
            port=3306,  # 端口
            user='root',  # 用户名
            passwd='qwertyuiop',  # 数据库密码
            db='myspider',  # 数据库名
            charset='utf8mb4',  # 字符集
        )
        self.cursor = self.connect.cursor()

    def add2cnblogs(self, item_list):
        for data in item_list:
            # sql语句 表名为cnblogs ，注意%s的这种占位符操作，这里用到了
            sql = "insert into cnblogs (title, author, url, description) values (%s,%s,%s,%s) ON DUPLICATE KEY UPDATE title=%s, author=%s, url=%s, description=%s"
            # 参数化方式传参，8个%s就要有8个参数与之对应
            row_count = self.cursor.execute(sql, [data.title, data.author, data.url, data.description, data.title, data.author, data.url, data.description])
        # 统一提交
        self.connect.commit()

    def end(self):
        self.cursor.close()
        self.connect.close()