爬虫就是从网页上抓取信息。Scrapy是其中一个流行的爬虫框架。
1. 安装scrapy
pip install scrapy
2. 创建爬虫项目结构并生成爬虫类
2.1 创建项目骨架 scrapy startproject
# 项目名称为helloworld
scrapy startproject helloworld
2.2 生成爬虫类模板 scrapy genspider
# 切换目录
cd spiders
# scrapy genspider 爬虫名字name 爬虫地址start_urls
# 注意确认一下生成的start_urls是否正确,有可能会多个http前缀,修改正确即可
scrapy genspider MyBlogSpider https://blog.csdn.net/vbirdbest
3. 解析response
import scrapy
class MyblogspiderSpider(scrapy.Spider):
name = 'MyBlogSpider'
allowed_domains = ['blog.csdn.net']
start_urls = ['https://blog.csdn.net/vbirdbest']
def parse(self, response):
title_list = response.xpath('//h4/a/text()').extract()
for i in title_list:
print(i)
4. 运行scrapy命令开始爬虫
切换到项目跟目录
cd helloworld
# 开始爬虫
scrapy crawl MyBlogSpider
- scrapy目录结构
- items.py:数据结构,爬虫获取到数据之后,传入管道文件(pipelines.py)的载体
- pipelines.py:数据管道,对传入的项目类中的数据进行清理和入库
- middlewares.py:中间件
- settings.py:配置文件,例如下载延迟,项目管道文件中类的启用以及自定义中间件的启用和顺序
- spiders:创建爬虫类需要继承scrapy.Spider类
- scrapy.cfg:scrapy基础配置
MyBlogSpider.py
import scrapy
from scrapy import Request
from helloworld.items import HelloworldItem
class MyblogspiderSpider(scrapy.Spider):
name = 'MyBlogSpider'
allowed_domains = ['blog.csdn.net']
start_urls = ['https://blog.csdn.net/vbirdbest']
def start_requests(self):
print("MyBlogSpider.py start_requests--------------")
yield Request(self.start_urls[0])
def parse(self, response):
print('MyBlogSpider.py parse -------------------')
title_list = response.xpath('//h4/a/text()').extract()
# 只有yield item 才会执行pipelines.py
for title in title_list:
title = title.strip()
title = title.replace('\n', '')
if len(title) > 0:
item = HelloworldItem()
item['name'] = title
yield item
items.py
import scrapy
class HelloworldItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
print(f'items.py name={name}---------------------------')
middlewares.py
class HelloworldDownloaderMiddleware(object):
def process_response(self, request, response, spider):
print('middlewares.py process_response---------------------------')
return response
pipelines.py
class
HelloworldPipeline(object):
def process_item(self, item, spider):
print(f'pipelines.py process_item item={item}----------------------')
return item
settings.py
打开DOWNLOADER_MIDDLEWARES和ITEM_PIPELINES两个配置
DOWNLO
ADER_MIDDLEWARES = {
'helloworld.middlewares.HelloworldDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'helloworld.pipelines.HelloworldPipeline': 300,
}
调用顺序:
- MyBlogSpider.py start_requests:开始请求
- middlewares.py process_response:处理响应
- MyBlogSpider.py parse:解析响应
- pipelines.py process_item:调用多次,每个item调用一次