mongodb 等保加固

转载

mob64ca14173efa 2024-12-02 14:20:29

文章标签 mongodb 等保加固 python mongodb 爬虫 ide 文章分类 MongoDB 数据库

爬虫之将Scrapy爬取数据保存至Mongodb数据库

需求：以1药网中中西药品分类中的所有页面为目标，爬取每件商品的单价，名称以及评论

在上一篇博客中，我们讲了Scrapy的基本使用以及各个文件该如何配置，与上篇博客中的案例相比，不同的地方就是在pipelines.py中对数据的处理不同。

创建爬虫文件

scrapy genspider yiyaowang yiyaowang.com

在yiyaowang.py文件中先编写回调函数，先爬取一页的数据

# -*- coding: utf-8 -*-
import scrapy

class YaowangSpider(scrapy.Spider):
    name = 'yiyaowang'
    # allowed_domains = ['yaowang.com']
    start_urls = [https://www.111.com.cn/categories/953710]

    def parse(self, response):

        # 提取数据
        li_list = response.xpath('//ul[@id="itemSearchList"]/li')
        for li in li_list:
			# 获取单价
			good_price = good.xpath('.//p[@class="price"]//span/text    ()').get().strip()
			
			# 获取标题 
			# good_title = good.xpath('.//p[@class="titleBox"]//a/te    xt()').get()
			# 发现问题：
			# 并没有返回None，而是返回一片空白
			# 分析：返回空白而不是返回None说明不是xpath路径，可能是>    返回的列表的第一个元素是一个空字符串
			# 解决：先用getall()全部取出来，然后再取我们需要的数据
			# 获取标题
			good_title = good.xpath('.//p[@class="titleBox"]//a/text    ()').getall()[1].strip()
			
			# 获取评论
			good_comment = good.xpath('.//a[@id="pdlink3"]//em/text(    )').get()

查找每一页url的规律，循环爬取所有页数

第一页：https://www.111.com.cn/categories/953710-j1.html
第二页：https://www.111.com.cn/categories/953710-j2.html
...
最后一页：https://www.111.com.cn/categories/953710-j50.html

总结发现：页数一共为50页，唯一变化的为j后面的数字，并且数字与页数对应

在原有代码上进行添加

# -*- coding: utf-8 -*-
import scrapy

class YaowangSpider(scrapy.Spider):
    name = 'yiyaowang'
    # allowed_domains = ['yaowang.com']
    # -------------------------------------------------------------------------
    start_urls = []
	base_url = "https://www.111.com.cn/categories/953710-j{}.html"
		# 得到每一页的url
		for i in range(1,51):
			start_urls.append(base_url.format(i))
	# -------------------------------------------------------------------------

    def parse(self, response):

        # 提取数据
        li_list = response.xpath('//ul[@id="itemSearchList"]/li')
        for li in li_list:
			# 获取单价
			good_price = good.xpath('.//p[@class="price"]//span/text    ()').get().strip()
			
			# 获取标题
			good_title = good.xpath('.//p[@class="titleBox"]//a/text    ()').getall()[1].strip()
			
			# 获取评论
			good_comment = good.xpath('.//a[@id="pdlink3"]//em/text(    )').get()

至此数据已经爬取数据，接下来要先进行数据的处理

在items.py中编写相应的类

class YiYaoWang(scrapy.Item):
    # 定义标题
    title = scrapy.Field()
    # 定义单价
    price = scrapy.Field()
    # 定义评价
    comment = scrapy.Field()

将数据放入item中准备让管道调用

import scrapy

# 读入item中的类
# ------------------------------------------------------------
from ..items import YiYaoWang
# ------------------------------------------------------------

class YiyaowangSpider(scrapy.Spider):
    name = 'yiyaowang'
    # allowed_domains = ['yiyaowang.com']
    start_urls = []
    base_url = "https://www.111.com.cn/categories/953710-j{}.html"
    # 得到每一页的url
    for i in range(1,51):
        start_urls.append(base_url.format(i))

    def parse(self, response):
        """从链接中获取数据"""
        good_list = response.xpath('//ul[@id="itemSearchList"]/li')

        # 实例化item对象
        # ----------------------------------------------------------
        item = YiYaoWang()
        # ----------------------------------------------------------

        # 获取数据
        for good in good_list:
            # 获取单价
            price = good.xpath('.//p[@class="price"]//span/text()').get().strip()

            # 获取标题
            # good_title = good.xpath('.//p[@class="titleBox"]//a/text()').get()
            # 发现问题：
            # 并没有返回None，而是返回一片空白
            # 分析：返回空白而不是返回None说明不是xpath路径，可能是返回的列表的第一个元素是一个空字符串
            # 解决：先用getall()全部取出来，然后再取我们需要的数据
            # 获取标题
            title = good.xpath('.//p[@class="titleBox"]//a/text()').getall()[1].strip()

            # 获取评论
            comment = good.xpath('.//a[@id="pdlink3"]//em/text()').get()
            
            # ---------------------------------------------------------------
            # 处理数据
            item["title"] = title
            item["price"] = price
            item["comment"] = comment
    		# ---------------------------------------------------------------
    		
            yield item

在pipelines.py管道文件中编写数据保存的类

class YiYaoWangPipeline:
    def open_spider(self,spider):
        # 创建链接
        self.client = pymongo.MongoClient(host="127.0.0.1",port=27017)
        # 进入数据库
        self.db = self.client["first_text"]
        # 进入集合
        self.col = self.db["yiyaowang"]

    def process_item(self,item,spider):
        # 插入数据
        self.col.insert({"标题":item["title"],"单价":item["price"],"评论>数":item["comment"]})
        return item

    def close_spider(self,spider):
       self.client.close()

将写好的管道加入到settings.py配置文件中

必须把以前爬虫文件的管道设置注释掉，不然以前爬虫文件的管道也会在现在的爬虫文件中运行一次，保存数据的参数不一样时就会报错

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    # 'reptile.pipelines.ReptilePipeline': 300,
	# 'reptile.pipelines.HuPuPipeline': 300,
 'reptile.pipelines.YiYaoWangPipeline': 300,
}

执行爬虫文件

scrapy crawl yiyaowang

打开命令窗口查看是否保存到了数据库

MongoDB Enterprise > show tables
yiyaowang

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：python判断对象具有属性

下一篇：django celery redis中未生成结果

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

mongodb 等保加固

mongodb 等保加固

爬虫之将Scrapy爬取数据保存至Mongodb数据库

创建爬虫文件

在yiyaowang.py文件中先编写回调函数，先爬取一页的数据

查找每一页url的规律，循环爬取所有页数

至此数据已经爬取数据，接下来要先进行数据的处理

将数据放入item中准备让管道调用

在pipelines.py管道文件中编写数据保存的类

将写好的管道加入到settings.py配置文件中

51CTO博客