很多读者可能都知道用scrapy框架进行数据爬取,本文将用博客园首页为例子,详细讲解爬虫抓取数据的全过程。
本文内容包括:
- 爬虫环境的搭建
- 爬虫的代码细节
- 常见的问题
一、环境搭建
1、安装scrapy
pip install scrapy
2、安装Docker(windows)
- 开启CPU虚拟化
- 开启Hyper-V
步骤:控制面板-程序-启动和关闭windows功能
- 关闭防火墙
- 获取安装程序
官网下载地址:https://www.docker.com/products/docker-desktop
读者也可以关注公众号哥说网事,发送138,快速获取网盘下载链接和提取码 - 安装过程与普通的windows程序安装无异,请读者自行操作。
- 启动Docker
从开始菜单启动Docker Desktop,右下角会有Docker的图标,等待几分钟后,鼠标悬浮会提示Docker Desktop is running
3、安装Splash
- 在命令工具中运行下面命令,等待一段时间直到安装成功。
docker pull scrapinghub/splash
- 启动Splash
docker run -p 8050:8050 scrapinghub/splash
可以在浏览器中输入http://localhost:8050/验证一下是否成功
4、安装Scrapy-Splash
pip install scrapy-splash
Scrapy-Splash是一个Python库,安装后就可在Scrapy中使用Splash服务了
二、开发自己的爬虫
scrapy框架的介绍和基本使用,请看教程http://www.scrapyd.cn/doc/140.html,写得浅显易懂。
1、建立新的项目
首先进入你的工程目录,如D:\dev\projects\,接着运行下面命令
scrapy startproject myspider
2、进入myspider
cd myspider
3、利用默认模板建立cnblogs的爬虫
scrapy genspider cnblogs
4、打开settings.py,进行Splash配置
参考官方文档:https://github.com/scrapy-plugins/scrapy-splash
我项目的配置如下:
# -*- coding: utf-8 -*-
# Scrapy settings for myspider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'myspider'
SPIDER_MODULES = ['myspider.spiders']
NEWSPIDER_MODULE = 'myspider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'myspider (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
# 'myspider.middlewares.MyspiderSpiderMiddleware': 543,
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
#'myspider.middlewares.MyspiderDownloaderMiddleware': 543,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
# 'myspider.pipelines.MyspiderPipeline': 300,
# }
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
SPLASH_URL = 'http://localhost:8050' #自己安装的docker里的splash
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
4、编写爬虫逻辑
打开cnblogs.py的文件,代码修改如下
# -*- coding: utf-8 cnblogs.py-*-
import scrapy
from scrapy_splash import SplashRequest #重新定义了请求
from myspider.spiders.db import Mysql
from myspider.spiders.db import DBTableItem
import sys
class CnblogsSpider(scrapy.Spider):
name = 'cnblogs'
start_urls = ['']
def start_requests(self): # 由此方法通过下面链接爬取页面
script = '''
function main(splash, args)
splash:go(args.url)
local scroll_to = splash:jsfunc("window.scrollTo")
scroll_to(0, 2800)
splash:set_viewport_full()
splash:wait(5)
return {html=splash:html()}
end
'''
url = ""
yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': script, 'url': url})
def parse(self, response): #页面解析函数,这里我们使用xpath
# 文章链接地址
url_list = response.xpath('//div[@class="post_item_body"]/h3/a/@href').extract()
# 文章标题
title_list = response.xpath('//div[@class="post_item_body"]/h3/a/text()').extract()
# 文章简介
desc_list = []
desc_list_tmp = response.xpath('//p[@class="post_item_summary"]/text()').extract()
for data in desc_list_tmp:
if data.isspace() is False: #检测简介是否为空
desc_list.append(data)
# 作者
author_list = response.xpath('//div[@class="post_item_foot"]/a/text()').extract()
item_list = []
for url, title, description, author in zip(url_list, title_list, desc_list, author_list):
item = DBTableItem(url, title, description, author)
item_list.append(item)
for data in item_list:
self.log(data.description)
mysql = Mysql()
mysql.add2cnblogs(item_list)
mysql.end()
数据库操作的代码如下,为了避免爬取重复数据,我们将url字段设置成唯一性索引,在插入时用ON DUPLICATE KEY UPDATE
import pymysql
import scrapy
class DBTableItem(object):
# 数据库表对象
def __init__(self, url, title, description, author):
self.title = title
self.author = author
self.url = url
self.description = description
self.category = ""
#操作mysql的类
class Mysql(scrapy.Spider):
def __init__(self):
self.connect = pymysql.Connect(
host='192.168.1.109', # mysql的主机ip
port=3306, # 端口
user='root', # 用户名
passwd='qwertyuiop', # 数据库密码
db='myspider', # 数据库名
charset='utf8mb4', # 字符集
)
self.cursor = self.connect.cursor()
def add2cnblogs(self, item_list):
for data in item_list:
# sql语句 表名为cnblogs ,注意%s的这种占位符操作,这里用到了
sql = "insert into cnblogs (title, author, url, description) values (%s,%s,%s,%s) ON DUPLICATE KEY UPDATE title=%s, author=%s, url=%s, description=%s"
# 参数化方式传参,8个%s就要有8个参数与之对应
row_count = self.cursor.execute(sql, [data.title, data.author, data.url, data.description, data.title, data.author, data.url, data.description])
# 统一提交
self.connect.commit()
def end(self):
self.cursor.close()
self.connect.close()
三、FAQ
1、Splash的作用什么?
Splash是一个javascript渲染服务。它是一个带有HTTP API的轻量级Web浏览器,通过它的接口来实现JavaScript页面的加载。
2、为什么要装Docker
因为在windows下Splash的运行是基于Docker的。