一、创建Scrapy项目zhongjj,进入zhongjj项目,创建爬虫文件zhongjjpc
scrapy startproject zhongjj
cd zhongjj
scrapy genspider zhongjjpc www.xxx.com
二、修改配置文件
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'
三、添加三个目标地址,其中最后一个地址是错误的url
start_urls = ["https://www.baidu.com/","https://www.sina.com.cn/","https://wwwwww.sohu.com/"]
四、修改中间件文件
1、删除爬虫中间件类ZhongjjSpiderMiddleware
2、修改拦截内容响应内容及异常内容
def process_request(self, request, spider):
print(request.url+"我是requests")
return None
def process_response(self, request, response, spider):
print(request.url+"我是response")
return response
def process_exception(self, request, exception, spider):
print(request.url+"我是异常信息")
pass
3、在settings文件里面开启中间件
DOWNLOADER_MIDDLEWARES = {
"zhongjj.middlewares.ZhongjjDownloaderMiddleware": 543,
}
五、运行结果,三个函数都被调用
六、开发中间件
1、代理中间件
request.meta['proxy'] = 'https://ip:port'
2、UA中间件
request.headers['User-Agent'] = 'Mozilla/5.0 (Windows ......'
3、Cookies中间件
request.headers['cookie'] = 'xxx'
第二种方法
request.cookies = 'xxx'