python redis 菜鸟 python redis操作

转载

我心依旧 2023-08-15 09:36:42

文章标签 python redis 菜鸟爬虫 python redis Redis 文章分类 Python 后端开发

一、python操作redis
步骤：

第一步安装 redis
pip install redis

第二步 python操作redis
1 、导入redis
import redis

2 、链接reids --> 链接地址端口号
r = redis.StrictRedis(host=‘localhost’,port=6379,db=0)

备注：指定参数host、port与指定的服务器和端口连接，host默认为localhost，port默认为6379，db默认为0

如图：

python redis 菜鸟 python redis操作_redis

3 、逻辑实现(增删改查)

python redis 菜鸟 python redis操作_python redis 菜鸟_02

将字节类型转成字符串类型：

python redis 菜鸟 python redis操作_爬虫_03

补充：

报错：

python redis 菜鸟 python redis操作_python redis 菜鸟_04

启动redis:

在安装路径（桌面）找到redis文件夹，点击进去。

python redis 菜鸟 python redis操作_爬虫_05

再次运行：（上图中的2个cmd不可以关闭）

python redis 菜鸟 python redis操作_python_06

二、其他（需要的时候，再回来看）
1、字符串相关操作

import redis

class TestString(object):
  def __init__(self):
        self.r = redis.StrictRedis(host='192.168.75.130',port=6379)
    设置值
    def test_set(self):
        res = self.r.set('user1','juran-1')
        print(res)
    取值
  def test_get(self):
        res = self.r.get('user1')
        print(res)
    设置多个值
    def test_mset(self):
        d = {
            'user2':'juran-2',
            'user3':'juran-3'
        }
        res = self.r.mset(d)
    取多个值
    def test_mget(self):
        l = ['user2','user3']
        res = self.r.mget(l)
        print(res)
    删除
  def test_del(self):
        self.r.delete('user2')

2、列表相关操作

class TestList(object):
    def __init__(self):
        self.r = redis.StrictRedis(host='192.168.75.130',port=6379)
  插入记录
    def test_push(self):
        res = self.r.lpush('common','1')
        res = self.r.rpush('common','2')
        # res = self.r.rpush('jr','123')
  弹出记录
    def test_pop(self):
        res = self.r.lpop('common')
        res = self.r.rpop('common')
  范围取值
    def test_range(self):
        res = self.r.lrange('common',0,-1)
        print(res)

3、集合相关操作

class TestSet(object):
    def __init__(self):
        self.r = redis.StrictRedis(host='192.168.75.130', port=6379)
  添加数据
    def test_sadd(self):
        res = self.r.sadd('set01','1','2')
        lis = ['Cat','Dog']
        res = self.r.sadd('set02',lis)
  删除数据
    def test_del(self):
        res = self.r.srem('set01',1)
  随机删除数据
    def test_pop(self):
        res = self.r.spop('set02')

4、哈希相关操作

class TestHash(object):
    def __init__(self):
        self.r = redis.StrictRedis(host='192.168.75.130', port=6379)
  
  批量设值
    def test_hset(self):
        dic = {
            'id':1,
            'name':'huawei'
        }
        res = self.r.hmset('mobile',dic)
  批量取值
    def test_hgetall(self):
        res = self.r.hgetall('mobile')
  判断是否存在  存在返回1  不存在返回0
    def test_hexists(self):
        res = self.r.hexists('mobile','id')
        print(res)

三、scrapy_redis（分布式爬虫）概述
学习目标：
（1）了解scrapy-redis的工作流程(面试)
（2）会把普通的scrapy爬虫改写成分布式爬虫

1、什么是scrapy_redis
分布式：多个人在一起做不同的事
集群：多个人在一起做相同的事

问：scrapy和scrapy-redis有啥区别？
scrapy ：python的爬虫框架，爬取效率极高具有高度的定制型，不支持分布式；
scrapy-redis ：基于redis数据库，运行在scrapy之上的一个组件，可以让scrapy支持分布式开发，支持主从同步。

2、分布式爬虫的优点
（1）可以充分利用多态机器的带宽
（2）可以充分利用不同电脑的ip
（3）多台机器爬取效率更高

于此同时出现的问题：
（1）怎么保证数据不会出现重复的？
（2）怎么保证数据存到同一个地方呢？

四、scrapy_redis工作流程

1、回顾scrapy工作流程

python redis 菜鸟 python redis操作_Redis_07

2、scrapy_redis工作流程

python redis 菜鸟 python redis操作_爬虫_08

也就是说，调度器把url给到redis,redis再给回调度器，再给到引擎。

五、如何实现分布式爬虫(步骤)

（一）几点说明：

1、安装： pip install scrapy-redis
2、scrapy_redis中的settings文件：（与scrapy的settings文件有很大不同）

# Scrapy settings for example project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#
SPIDER_MODULES = ['example.spiders']
NEWSPIDER_MODULE = 'example.spiders'

USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"    # 指定那个去重方法给request对象去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler"    # 指定Scheduler队列
SCHEDULER_PERSIST = True        # 队列中的内容是否持久保存,为false的时候在关闭Redis的时候,清空Redis
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"

ITEM_PIPELINES = {
    'example.pipelines.ExamplePipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,    # scrapy_redis实现的items保存到redis的pipline
}

LOG_LEVEL = 'DEBUG'

# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
DOWNLOAD_DELAY = 1

3、运行结束后redis中多了三个键

dmoz:requests 存放的是待爬取的requests对象

dmoz:item 爬取到的信息

dmoz:dupefilter 爬取的requests的指纹（也就是已经爬过了的requests对象）

python redis 菜鸟 python redis操作_爬虫_09

（二）将scrapy改成scrapy_redis的步骤：
1、scrapy的步骤：
第一步创建scrapy项目
第二步创建爬虫文件
第三步逻辑操作

2、改成scrapy_redis的步骤：

改的是第二步爬虫文件：
（1）模块
（2）继承的父类
（3）把start_urls 改写成 reids_key=‘爬虫文件名字’

以及settings文件当中的内容：

#去重过滤
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
#scheduler队列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
#数据持久化
SCHEDULER_PERSIST = True

ITEM_PIPELINES = {
    'example.pipelines.ExamplePipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。