scrapy 爬虫入门
原创
©著作权归作者所有:来自51CTO博客作者liuyunshengsir的原创作品,请联系作者获取转载授权,否则将追究法律责任
1.安装scrapy 环境
cmd 命令执行conda install scrapy 即可
2.创建项目
scrapy startproject spider_name
3.构建爬虫(一个工程中可以存在多个spider, 但是名字必须唯一(进入到E:\spider_name\spider_name\spiders再构建))
scrapy genspider garlic http://www.51garlic.com/hq/list-139.html
4.查看当前项目内有多少爬虫
scrapy list
5.执行爬虫
scrapy crawl garlic -o abc.csv
6.编写的爬虫代码garlic.py
# -*- coding: utf-8 -*-
import scrapy
class GarlicSpider(scrapy.Spider):
name = "garlic"
start_urls=["http://www.51garlic.com/hq/list-139.html",
"http://www.51garlic.com/hq/list-139-2.html",]
def parse(self, response):
for href in response.css('.td-lm-list a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url,callback=self.parse_question)
def parse_question(self, response):
yield {
'title':response.css('.td-timu').extract()[0].encode('utf-8'),
'txt':response.css('.td-nei-content').extract()[0].encode('utf-8'),
'link': response.url,
}