scrapy-redis的关键配置

原创

东方佑 2021-04-22 21:38:26 ©著作权

©著作权归作者所有：来自51CTO博客作者东方佑的原创作品，请联系作者获取转载授权，否则将追究法律责任

settings中的配置

DUPEFILTER_CLASS = “scrapy_redis.dupefilter.RFPDupeFilter”
SCHEDULER = “scrapy_redis.scheduler.Scheduler”
SCHEDULER_PERSIST = True

REDIS_HOST=‘192.168.72.137’
REDIS_PORT=6379
REDIS_PASSWORD=’’

spider中配置
。。。
。。。
。。。
from scrapy_redis.spiders import RedisCrawlSpider

class MypeopleSpider(RedisCrawlSpider):
name = ‘mypeople’
allowed_domains = [‘people.com.cn’]
# start_urls = [‘http://politics.people.com.cn/GB/1024/index1.html’]
redis_key = “mypeople:start_url”

rules = (Rule(LinkExtractor(allow=("index(\d+).html")), callback="get_parse", follow=True),)

def get_parse(self, response):
	pass

scrapy-redis配置解释

SCHEDULER=“scrapy_redis.scheduler.Scheduler”DUPEFILTER=“scrapy_redis.dupefilter.RFPDupeFilter”# 链接redisREDIS_HOST=‘192.168.72.137’REDIS_PORT=6379REDIS_PASSWORD=’’#配置队列可以不写默认或是写三个中的一个# SCHEDULER_QUEUE_CLASS=‘scrapy_redis.queue.PriorityQueue’# SCHEDULER_QUEUE_CLASS=‘scrapy_redis.queue.FifoQueue’SCHEDULER_QUEUE_CLASS=‘scrapy_redis.queue.LifoQueue’#持久化配置使用True的时候就是指纹队列爬取后不会清空否则自动清空默认是FalseSCHEDULER_PERSIST=True# 重爬配置默认为False 如果持久配置为True那么中断再启动爬虫队列和指纹队列都不会清空 True时候会清空连续不上# SCHEDULER_FLUSH_ON_START=True# PIPELINE配置默认不启动启动的话会把item存到redis中# ITEM_PIPELINES={# ‘scrapy_redis.pipelines.RedisPipeline’:300# }配置数据库存储

MONGO_URI=‘mongodb://admin:admin123@127.0.0.1:27017’

最好看源码
源码
https://github.com/rmax/scrapy-redis/tree/master/src/scrapy_redis