scrapy不建议通过爬虫文件来发送请求下载大文件,而是通过scrapy已经封装好的管道类去执行,效率更高
管道类: from scrapy.pipelines.images import ImagesPipeline ,不仅仅能下载图片,也能下载音频、视频等其他二进制文件
我们定义一个类去继承自这个管道类,然后重写以下3个方法:
- get_media_requests(self, item, info):发送请求
- file_path(self, request, response=None, info=None, *, item=None):返回文件名
- item_completed(self, results, item, info):返回item,供后续管道继续处理
配置文件settings.py:
- 需要指定保存文件的文件夹IMAGES_STORE = 'dirName' ,文件夹如果没有事先创建的话,则会自动创建
爬取中国大学生校花网首页中的图片
url:http://www.xiaohuar.com/daxue/
- items.py
import scrapy
class BigfileItem(scrapy.Item):
name = scrapy.Field() #图片名
src = scrapy.Field() #图片地址
- xiaohua.py
import scrapy
from lxml import etree
from bigFile.items import BigfileItem
class XiaohuaSpider(scrapy.Spider):
name = 'xiaohua'
allowed_domains = ['www.xiaohuar.com']
start_urls = ['http://www.xiaohuar.com/daxue/']
def parse(self, response):
page_text = response.text
tree = etree.HTML(page_text)
divs = tree.xpath('//div[@class="card diy-box shadow mb-5"]')
for div in divs:
name = div.xpath('./a/img/@alt')[0]+'.jpg' #图片名
src = div.xpath('./a/img/@src')[0] #图片下载地址
item = BigfileItem()
item['name'] = name
item['src'] = src
yield item
- middlewares.py
from fake_useragent import UserAgent
class BigfileDownloaderMiddleware:
def process_request(self, request, spider):
request.headers['User-Agent'] = UserAgent(use_cache_server=False).random #给每个请求添上随机UA
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
- pipelines.py
from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request
class BigfilePipeline(ImagesPipeline):
# 根据图片地址发送请求
def get_media_requests(self, item, info):
yield Request(item['src'],meta={'item':item})
#只需要返回图片名
def file_path(self, request, response=None, info=None, *, item=None):
item = request.meta['item']
return item['name']
#返回item,供后续管道继续处理
def item_completed(self, results, item, info):
return item
- settings.py
BOT_NAME = 'bigFile'
SPIDER_MODULES = ['bigFile.spiders']
NEWSPIDER_MODULE = 'bigFile.spiders'
ROBOTSTXT_OBEY = False
IMAGES_STORE = './imgs' #下载好的图片,存储在imgs文件夹下
#启用下载中间件
DOWNLOADER_MIDDLEWARES = {
'bigFile.middlewares.BigfileDownloaderMiddleware': 543,
}
#启用管道
ITEM_PIPELINES = {
'bigFile.pipelines.BigfilePipeline': 300,
}
- 爬取效果演示