使用Scrapy框架对网站的内容进行爬取

在桌面处打开终端,并在终端中输入:

scrapy startproject bitNews
cd bitNews/bitNews

修改items文件的内容,输入vim items.py按 i 进行编辑,将其中的代码修改为:

# -*- coding: utf-8 -*-
import scrapy


class BitnewsItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
pass

按shift+zz 退出。在终端输入:

scrapy genspider bitnews "www.bit.edu.cn"
cd spiders
vim bitnews.py

修改代码为下图所示:

# -*- coding: utf-8 -*-
import scrapy
from bitNews.items import BitnewsItem

class BitnewsSpider(scrapy.Spider):
name = 'bitnews'
allowed_domains = ['www.bit.edu.cn']
start_urls = ['http://www.bit.edu.cn/xww/jdgz/index.htm']

def parse(self, response):
items=[]
div = response.xpath("//div[@class='new_con']")
for each in div.xpath("ul/li"):
item=BitnewsItem()
item['name']=each.xpath('a/text()').extract()
items.append(item)
pass
return items

保存退出之后,在终端输入:cd ..

修改settings.py:vim settings.py

找到ROBOTSTXT_OBEY的值改为False:并添加设置如下:

ROBOTSTXT_OBEY=False
FEED_EXPORT_ENCODING = "UTF-8"

保存退出后,终端输入:

scrapy crawl bitnews -o news.json

4.2:Scrapy爬虫_ide

 


作者:​​哥们要飞​