还记得在之前一篇python开发电影查询系统(一)—python实现后台数据中,对电影的下载地址无法进行爬取,原因是下载地址在网页源码中无法查看,而是存放在js中,动态加载了。所以在爬取时,我在文章中写道
现在,我们找到了攻破他反爬的方法。下面我来详细介绍一下。
robobrowser库所做的事情就是模拟你真实的浏览器,并可加载动态js页面,从而爬取数据。是不是很牛逼啊。
一、robobrowser库的下载安装。
直接用python的pip安装即可
pip3 install robobrowser
二、使用方法
安装完成后,使用help查看使用方法。
- 我们在电影首页,随便点一个电影链接进入到电影详情页面。比如http://www.bd-film.co/gq/25601.htm;
- 进入以后,我们打开F12,查看网页源代码。刷新页面,查看network
将General和Request headers复制下来。
# -*- coding: utf-8 -*-
import robobrowser
import time
from requests import Session
urls = []
ua = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
session = Session()
# 直接从浏览器的F12取的headers,不这样的话,网站有反爬虫机制
# 数据爬了几十条后就返回无数据内容的页面了
session.headers = {
"Request URL": film_url,
"Request Method": "GET",
#"Remote Address": "",
"Referrer Policy": "no-referrer-when-downgrade",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Cookie": "Hm_lvt_0fae9a0ed120850d7a658c2cb0783b55=1527565708,1527653577,1527679892,1527729123; Hm_lvt_cdce8cda34e84469b1c8015204129522=1527565709,1527653577,1527679892,1527729124; _site_id_cookie=1; clientlanguage=zh_CN; JSESSIONID=5AA866B8CDCDC49CA4B13D041E02D5E1; yunsuo_session_verify=c1b9cd7af99e39bbeaf2a6e4127803f1; Hm_lpvt_0fae9a0ed120850d7a658c2cb0783b55=1527731668; Hm_lpvt_cdce8cda34e84469b1c8015204129522=1527731668",
"Host": "www.bd-film.co",
"Proxy-Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": ua
}
- 查看每个下载url的源码,借助css选择器,把url的selecter地址复制下来。
我们多复制几个看看
#downlist > div > div > div:nth-child(1) > div
#downlist > div > div > div:nth-child(2) > div
#downlist > div > div > div:nth-child(3) > div
发现规律,所有下载地址的selecter地址中都有downlist ,所以我们会有下面代码中处理机制。
rb = robobrowser.RoboBrowser(parser="html.parser", session=session)
rb.open(url=film_url)
r = rb.select('#downlist')
if not r:
# print(rb.response.content.decode())
raise RuntimeError("获取网页内容失败")
- 根据“复制地址”所对应的url(已找到规律),来获取其后面的迅雷、小米等具体的下载链接。
现在我们来看看他们具体对应到迅雷,小米,百度云盘的下载链接。
代码如下:
r = r[0]
for v in range(128):#这里循环次数根据你想爬取的数目为准
id_name = '#real_address_%d' % v
dl = r.select(id_name)
if not dl:
break
dl = dl[0].select('.form-control')[0].text
#这里dl就是具体下载地址了
OK,完整代码如下:
# -*- coding: utf-8 -*-
import robobrowser
import time
def get_bd_film_download_urls(film_url):
from requests import Session
urls = []
try:
ua = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
session = Session()
# 直接从浏览器的F12取的headers,不这样的话,网站有反爬虫机制
# 数据爬了几十条后就返回无数据内容的页面了
session.headers = {
"Request URL": film_url,
"Request Method": "GET",
#"Remote Address": "",
"Referrer Policy": "no-referrer-when-downgrade",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Cookie": "Hm_lvt_0fae9a0ed120850d7a658c2cb0783b55=1527565708,1527653577,1527679892,1527729123; Hm_lvt_cdce8cda34e84469b1c8015204129522=1527565709,1527653577,1527679892,1527729124; _site_id_cookie=1; clientlanguage=zh_CN; JSESSIONID=5AA866B8CDCDC49CA4B13D041E02D5E1; yunsuo_session_verify=c1b9cd7af99e39bbeaf2a6e4127803f1; Hm_lpvt_0fae9a0ed120850d7a658c2cb0783b55=1527731668; Hm_lpvt_cdce8cda34e84469b1c8015204129522=1527731668",
"Host": "www.bd-film.co",
"Proxy-Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": ua
}
rb = robobrowser.RoboBrowser(parser="html.parser", session=session)
rb.open(url=film_url)
if rb.response.status_code != 200:
return urls
r = rb.select('#downlist')#使用css过滤器筛选出下载链接的关键字段
if not r:
# print(rb.response.content.decode())
raise RuntimeError("获取网页内容失败")
r = r[0]
for v in range(128):
id_name = '#real_address_%d' % v
dl = r.select(id_name)
if not dl:
break
dl = dl[0].select('.form-control')[0].text
urls.append(dl)
except Exception as err:
print('error:',film_url, err)
return urls
if __name__ == '__main__':
for i in range(25000, 25700):
ul = 'http://www.bd-film.co/zx/%d.htm' % i
down_urls = get_bd_film_download_urls(ul)
if down_urls:
s = '-->'
print(ul, s, ','.join(down_urls))
time.sleep(1)
# break
效果展示:
将–>后面的地址复制迅雷,就可以下载了~~快去试试吧!