一:总体思路
先正常构建Scrapy项目,然后将Scrapy-redis整合进正常Scrapy项目中,最后进行分布式部署。
其中,分布式部署包括:
中心节点安装redis、(mysql)
各子节点均安装python、scrapy、scrapy-redis、Python的redis模块(与pymysql模块)
将修改好的分布式爬虫项目部署到各子节点
各子节点分别运行分布式爬虫项目
二:详细实现思路即代码
内容详细:
1. start_requests
def start_requests(self):
for url in self.start_urls:
yield Request(url=url,callback=self.parse2)
def start_requests(self):
req_list = []
for url in self.start_urls:
req_list.append(Request(url=url,callback=self.parse2))
return req_list
因为scrapy内部会将返回值转换成迭代器。
2. 解析器
将字符串转换成对象:
- 方式一:
response.xpath('//div[@id='content-list']/div[@class='item']')
- 方式二:
hxs = HtmlXPathSelector(response=response)
items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
查找规则:
//a
//div/a
//a[re:test(@id, "i\d+")]
items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
for item in items:
item.xpath('.//div')
解析:
标签对象:xpath('/html/body/ul/li/a/@href')
列表: xpath('/html/body/ul/li/a/@href').extract()
值: xpath('//body/ul/li/a/@href').extract_first()
PS:
单独应用
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="UTF-8">
<title></title>
</head>
<body>
<ul>
<li class="item-"><a id='i1' href="link.html">first item</a></li>
<li class="item-0"><a id='i2' href="llink.html">first item</a></li>
<li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
</ul>
<div><a href="llink2.html">second item</a></div>
</body>
</html>
"""
response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
obj = response.xpath('//a[@id="i1"]/text()').extract_first()
print(obj)
chrome
xpath
3. pipelines
- pipelines基础
class FilePipeline(object):
def process_item(self, item, spider):
print('写入文件',item['href'])
return item
def open_spider(self, spider):
"""
爬虫开始执行时,调用
:param spider:
:return:
"""
print('打开文件')
def close_spider(self, spider):
"""
爬虫关闭时,被调用
:param spider:
:return:
"""
print('关闭文件')
- 多pipelines(值越小优先级越高)
- 多pipelines,返回值会传递给下一个pipelines的process_item
PS:如果想要丢弃,不给后续pipeline使用:
from scrapy.exceptions import DropItem
class FilePipeline(object):
def process_item(self, item, spider):
print('写入文件',item['href'])
# return item
raise DropItem()
- 根据配置文件读取相关值,再进行pipeline处理
class FilePipeline(object):
def __init__(self,path):
self.path = path
self.f = None
@classmethod
def from_crawler(cls, crawler):
"""
初始化时候,用于创建pipeline对象
:param crawler:
:return:
"""
path = crawler.settings.get('XL_FILE_PATH')
return cls(path)
def process_item(self, item, spider):
self.f.write(item['href']+'\n')
return item
def open_spider(self, spider):
"""
爬虫开始执行时,调用
:param spider:
:return:
"""
self.f = open(self.path,'w')
def close_spider(self, spider):
"""
爬虫关闭时,被调用
:param spider:
:return:
"""
self.f.close()
4. POST/请求头/Cookie
自动登录抽屉+点赞
POST+请求头:
from scrapy.http import Request
req = Request(
url='http://dig.chouti.com/login',
method='POST',
body='phone=8613121758648&password=woshiniba&oneMonth=1',
headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
cookies={},
callback=self.parse_check_login,
)
cookies:
手动:
cookie_dict = {}
cookie_jar = CookieJar()
cookie_jar.extract_cookies(response, response.request)
for k, v in cookie_jar._cookies.items():
for i, j in v.items():
for m, n in j.items():
cookie_dict[m] = n.value
req = Request(
url='http://dig.chouti.com/login',
method='POST',
headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
body='phone=8615131255089&password=pppppppp&oneMonth=1',
cookies=cookie_dict, # 手动携带
callback=self.check_login
)
yield req
自动:
class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['chouti.com']
start_urls = ['http://dig.chouti.com/',]
def start_requests(self):
for url in self.start_urls:
yield Request(url=url,callback=self.parse_index,meta={'cookiejar':True})
def parse_index(self,response):
req = Request(
url='http://dig.chouti.com/login',
method='POST',
headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
body='phone=8613121758648&password=woshiniba&oneMonth=1',
callback=self.parse_check_login,
meta={'cookiejar': True}
)
yield req
def parse_check_login(self,response):
# print(response.text)
yield Request(
url='https://dig.chouti.com/link/vote?linksId=19440976',
method='POST',
callback=self.parse_show_result,
meta={'cookiejar': True}
)
def parse_show_result(self,response):
print(response.text)
配置文件制定是否允许操作cookie:
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
5. 去重规则
配置:
DUPEFILTER_CLASS = 'xianglong.dupe.MyDupeFilter'
编写类:
class MyDupeFilter(BaseDupeFilter):
def __init__(self):
self.record = set()
@classmethod
def from_settings(cls, settings):
return cls()
def request_seen(self, request):
if request.url in self.record:
print('已经访问过了', request.url)
return True
self.record.add(request.url)
def open(self): # can return deferred
pass
def close(self, reason): # can return a deferred
pass
问题:为请求创建唯一标识
http://www.oldboyedu.com?id=1&age=2
http://www.oldboyedu.com?age=2&id=1
from scrapy.utils.request import request_fingerprint
from scrapy.http import Request
u1 = Request(url='http://www.oldboyedu.com?id=1&age=2')
u2 = Request(url='http://www.oldboyedu.com?age=2&id=1')
result1 = request_fingerprint(u1)
result2 = request_fingerprint(u2)
print(result1,result2)
问题:记录到低要不要放在数据库?【使用redis集合存储】
访问记录可以放在redis中。
补充:dont_filter到低在哪里?
from scrapy.core.scheduler import Scheduler
def enqueue_request(self, request):
# request.dont_filter=False
# self.df.request_seen(request):
# - True,已经访问
# - False,未访问
# request.dont_filter=True,全部加入到调度器
if not request.dont_filter and self.df.request_seen(request):
self.df.log(request, self.spider)
return False
# 如果往下走,把请求加入调度器
dqok = self._dqpush(request)
6. 中间件
问题:对爬虫中所有请求发送时,携带请求头?
方案一:在每个Request对象中添加一个请求头
方案二:下载中间件
配置:
DOWNLOADER_MIDDLEWARES = {
'xianglong.middlewares.UserAgentDownloaderMiddleware': 543,
}
编写类:
class UserAgentDownloaderMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
request.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
# return None # 继续执行后续的中间件的process_request
# from scrapy.http import Request
# return Request(url='www.baidu.com') # 重新放入调度器中,当前请求不再继续处理
# from scrapy.http import HtmlResponse # 执行从最后一个开始执行所有的process_response
# return HtmlResponse(url='www.baidu.com',body=b'asdfuowjelrjaspdoifualskdjf;lajsdf')
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
方案三:内置下载中间件
配置文件:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
总结:
1. 存储过程、触发器函数等作用?
2. 优化补充:
- 读写分离,利用数据库的主从进行分离:主,用于删除、修改更新;从,查。
原生SQL:
select * from db.tb
ORM:
model.User.objects.all().using("default")
PS: 路由 db router
- 分库,当数据库中表太多,将表分到不同的数据库;例如:1w张表
- 分表
- 缓存:利用redis、memcache
3. 创建索引
text列如果想要创建索引,必须执行长度。
4. start_requests
- 可迭代对象
- 生成器
5. pipelines
- 配置
ITEM_PIPELINES = {
'xianglong.pipelines.FilePipeline': 300,
}
- 写类
class FilePipeline(object):
def __init__(self,path):pass
@classmethod
def from_crawler(cls, crawler):
pass
def process_item(self, item, spider):
pass
return item
def open_spider(self, spider):
pass
def close_spider(self, spider):
pass
6. 去重
- 配置
DUPEFILTER_CLASS = 'xianglong.dupe.MyDupeFilter'
- 写类
class MyDupeFilter(BaseDupeFilter):
def __init__(self):
pass
@classmethod
def from_settings(cls, settings):
pass
def request_seen(self, request):
pass
def open(self): # can return deferred
pass
def close(self, reason): # can return a deferred
pass
7. 下载中间件
- 配置
DOWNLOADER_MIDDLEWARES = {
'xianglong.middlewares.UserAgentDownloaderMiddleware': 543,
}
- 类
class UserAgentDownloaderMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
pass
def process_request(self, request, spider):
pass
def process_response(self, request, response, spider):
pass
def process_exception(self, request, exception, spider):
pass
8. POST/请求头/Cookie