爬虫就是从网页上抓取信息。Scrapy是其中一个流行的爬虫框架。

1. 安装scrapy

pip install scrapy

2. 创建爬虫项目结构并生成爬虫类

2.1 创建项目骨架 scrapy startproject

# 项目名称为helloworld
scrapy startproject helloworld

Python爬虫篇:爬虫框架Scrapy_爬虫


2.2 生成爬虫类模板 scrapy genspider

# 切换目录
cd spiders

# scrapy genspider 爬虫名字name 爬虫地址start_urls
# 注意确认一下生成的start_urls是否正确,有可能会多个http前缀,修改正确即可
scrapy genspider MyBlogSpider https://blog.csdn.net/vbirdbest

3. 解析response

import scrapy


class MyblogspiderSpider(scrapy.Spider):
    name = 'MyBlogSpider'
    allowed_domains = ['blog.csdn.net']
    start_urls = ['https://blog.csdn.net/vbirdbest']

    def parse(self, response):
        title_list = response.xpath('//h4/a/text()').extract()
        for i in title_list:
            print(i)

4. 运行scrapy命令开始爬虫

切换到项目跟目录
cd helloworld

# 开始爬虫
scrapy crawl MyBlogSpider
  1. scrapy目录结构
  • items.py:数据结构,爬虫获取到数据之后,传入管道文件(pipelines.py)的载体
  • pipelines.py:数据管道,对传入的项目类中的数据进行清理和入库
  • middlewares.py:中间件
  • settings.py:配置文件,例如下载延迟,项目管道文件中类的启用以及自定义中间件的启用和顺序
  • spiders:创建爬虫类需要继承scrapy.Spider类
  • scrapy.cfg:scrapy基础配置

MyBlogSpider.py

import scrapy
from scrapy import Request

from helloworld.items import HelloworldItem


class MyblogspiderSpider(scrapy.Spider):
    name = 'MyBlogSpider'
    allowed_domains = ['blog.csdn.net']
    start_urls = ['https://blog.csdn.net/vbirdbest']

    def start_requests(self):
        print("MyBlogSpider.py   start_requests--------------")
        yield Request(self.start_urls[0])

    def parse(self, response):
        print('MyBlogSpider.py    parse -------------------')
        title_list = response.xpath('//h4/a/text()').extract()

        # 只有yield item 才会执行pipelines.py
        for title in title_list:
            title = title.strip()
            title = title.replace('\n', '')
            if len(title) > 0:
                item = HelloworldItem()
                item['name'] = title
                yield item

items.py

import scrapy


class HelloworldItem(scrapy.Item):
    # define the fields for your item here like:

    name = scrapy.Field()
    print(f'items.py    name={name}---------------------------')

middlewares.py

class HelloworldDownloaderMiddleware(object):
    
    def process_response(self, request, response, spider):
        print('middlewares.py   process_response---------------------------')
        return response

pipelines.py

class
HelloworldPipeline(object):
    def process_item(self, item, spider):
        print(f'pipelines.py    process_item    item={item}----------------------')
        return item

settings.py
打开DOWNLOADER_MIDDLEWARES和ITEM_PIPELINES两个配置

DOWNLO
ADER_MIDDLEWARES = {
    'helloworld.middlewares.HelloworldDownloaderMiddleware': 543,
}

ITEM_PIPELINES = {
    'helloworld.pipelines.HelloworldPipeline': 300,
}

调用顺序:

  1. MyBlogSpider.py start_requests:开始请求
  2. middlewares.py process_response:处理响应
  3. MyBlogSpider.py parse:解析响应
  4. pipelines.py process_item:调用多次,每个item调用一次