scrapy 爬虫入门

原创

liuyunshengsir 2023-01-09 17:30:36 博主文章分类：机器学习 ©著作权

文章标签 ide css html 文章分类 运维

©著作权归作者所有：来自51CTO博客作者liuyunshengsir的原创作品，请联系作者获取转载授权，否则将追究法律责任

1.安装scrapy 环境

cmd 命令执行conda install scrapy 即可

2.创建项目

scrapy startproject spider_name

3.构建爬虫（一个工程中可以存在多个spider, 但是名字必须唯一(进入到E:\spider_name\spider_name\spiders再构建)）

scrapy genspider garlic http://www.51garlic.com/hq/list-139.html

4.查看当前项目内有多少爬虫

scrapy list

5.执行爬虫

scrapy crawl garlic -o abc.csv

6.编写的爬虫代码garlic.py

# -*- coding: utf-8 -*-
import scrapy


class GarlicSpider(scrapy.Spider):
  name = "garlic"
  start_urls=["http://www.51garlic.com/hq/list-139.html",
             "http://www.51garlic.com/hq/list-139-2.html",]

  def parse(self, response):
    for href in response.css('.td-lm-list a::attr(href)'):
      full_url = response.urljoin(href.extract())
      yield scrapy.Request(full_url,callback=self.parse_question)

  def parse_question(self, response):
    yield {
      'title':response.css('.td-timu').extract()[0].encode('utf-8'),
      'txt':response.css('.td-nei-content').extract()[0].encode('utf-8'),
      'link': response.url,
    }

上一篇：go 远程调试dlv

下一篇：＜Zero to One＞２.perfect competition and monopoly

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯