robot调用python脚本 robot python book

转载

mob6454cc6c8549 2023-12-06 17:12:02

文章标签 robot调用python脚本 html txt文件时间戳 文章分类 Python 后端开发

上一篇笔记提到链接爬虫，在书中还提到，可以添加一些其他功能，可以在爬取其他网站时更加有用。

1.解析robots.txt

我们需要解析robots.txt 文件，以避免下载禁止爬取的URL。使用Python自带的robotparser模块，就可以轻松完成这项工作。

>>>import robotparser

>>>rp=robotparser.RobotFileParser()

>>>rp.set_url('http: //example.webscraping.com/robots.txt')

>>>rp.read()

>>>url='http: //example.webscraping. com'

>>>user_agent='BadCrawler'

>>>rp.can_fetch(user_agent,url)

False

>>>user_agent='GoodCrawler'

>>>rp.can_fetch(user_agent,url)

True

robotparser模块首先加载robots.txt文件，然后通过can_fetch()函数确定指定的用户代理是否允许访问网页。

为了将该功能集成到爬虫中，我们需要在crawl循环中添加该检查。

while crawl_queue:
    url=crawl_queue.pop()
    #检查url是否通过robot.txt限制
    if rp.can_fetch(user.agent,url)
        ...
    else:
        print('Blocked by robots.txt:',url)

2.支持代理

def download(url,user_agent='wswp',proxy=None,num_retries=2): print('Downloading',url) headers={'User-agent':user_agent} request=urllib2.Request(url,headers=headers) opener=urllib.build_opener() if proxy: proxy_params={urlparse.urlparse(url).scheme:proxy} opener.add_header(urrlib2.ProxyHandler(proxy_params)) try: html=opener.open(request).read() except urllib2.URLError as e: print 'Download error:',e.reason html=None if num_retries>0: if hasattr(e,'code')and 500<=e.code<600: html=download(url,user_agent,proxy,num_retires-1) return html

3.下载限速

在两次下载之间添加延时，从而对爬虫限速。

class Throttle: #增加一个延迟在相同下载的域中 def __init__(self,delay): #每个域之间的下载的延迟数量 self.delay=delay #一个最近存储的域的时间戳 self.domains={} def wait(self,url): domain=urlparse.urlparse(url).netloc last_accessed=self.domains.get(domian) if self.delay>0 and last_accessed is not None: sleep_secs=self.delay-(datetime.datetime.now()-last_accessed).seconds if sleep_secs>0: time.sleep(sleep_secs) #更新最后获取时间 self.domians[domain]=datetime.datetime.now()

Throttle类记录了每个域名上次访问的时间内，如果距上次访问时间小于制定延迟时间，则执行睡眠操作。

throttle=Throttle(delay)

...

throttle.wait(url)

result=download(url,headers,proxy=proxy,num_retries=num_retries)

4.避免爬虫陷阱

目前，我们的爬虫会跟踪所有之前没有访问过的链接。但是，一些网站会动态生成页面内容，这样就会出现无限多的网页。比如，网站有一个在线日历功能，提供了可以访问下个月和下一年的链接，那么下个月的页面中同样会包含访问再下个月的链接，这样页面就会无止境地链接下去。这种情况被称为爬虫陷阱。

想要避免陷入爬虫陷阱，一个简单的方法是记录到达当前网页经过了多少个链接，也就是深度。当到达最大深度时，爬虫就不再向队列中添加该网页中的链接了。要实现这一功能，我们需要修改 seen 变量。该变量原先只记录访问过的网页链接，现在修改为一个字典，增加了页面深度的记录。

def link_crawler(...,max_depth=2): max_depth=2 seen={} ... depth=seen[url] if depth!=max_depth: for link in links: if link not in seen: seen[link]=depth+1 crawl_queue.append(link)

禁用该功能，只需要将max_path设为一个负数即可，此时当前深度永远不会与之相等。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。