对于每一个计算机视觉爱好者来说,三大顶会绝对是密切关注的对象。顶会的文章通常会以pdf的形式挂在网上供以下载,那么是否可以通过爬虫把这些论文都爬下来呢?
首先,把所有论文爬下来有没有意义呢? 当然是有意义的,一方面在没网的时候也可以随时随地查论文,另一方面可以再写个pdf内容解析的脚本,对所有爬下来的论文进行分析,就像多了一双眼睛帮你看论文。
如果只想复现的同学直接拉倒文尾看代码。
先说说自己期待的爬虫脚本的样子:(描述美好的未来)
- 可以下载多个会议的文章:输入'CVPR2018'它就给我爬CVPR2018,输入'ECCV2018'它就给我爬ECCV2018;
- 只要我run这个脚本,它就给我生成一个对应名字的文件夹,比如'CVPR2018/',所有的结果都存放在'CVPR2018/'目录下;
- 运行的同时会把所有文档的标题存成'xxxx_title.txt',把所有的pdf链接存成'xxxx_url.txt'放在对应目录下;
- 有一个进度条告诉下载的进度,所有下载的论文存放在指定文件夹下。
- 链接失效的文章直接跳过;
- 只需python get_papers.py就可以搞定上述功能。
我分别运行了三次get_papers.py,下载了['CVPR2018', 'ECCV2018', 'CVPR2017']的所有论文,目录结构如下:
三个文件夹里分别包含各自所有论文的pdf,如下:
爬下来的论文效果如上,是不是满足最开始所期待的样子?接下来看看怎么实现
实现之前先要熟悉环境,爬虫代码倒是对环境的依赖性不是很大,以下是默认环境,供以参考:
- Ubuntu 16.04
- Python 3.6.5
- Urllib3==1.22
- tqdm==4.28.1
当然,硬件环境也需要额外说一下:
CVPR2018的所有文章一共2.2G,ECCV2018一共1.4G,所以你至少要预留个几G的空间来存放论文。此外,因为需要缓存页面,所以电脑的内存可以大一点(memory>=4G)
urllib是爬虫的基本模块,本程序就是通过这个模块来获取页面的。CVPR2018的文章可以在openaccess的网站上找到:http://openaccess.thecvf.com/CVPR2018.py
所以我们要做的就是hack这个页面。
第一步,获取页面内容,并且解码显示:
headers = {'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
try:
req = request.Request(url=self.url, headers=headers)
page = request.urlopen(req, data=None, timeout=10)
str_content = page.read().decode('unicode_escape')
print(str_content[0:1000])
except UnicodeDecodeError as e:
print(e)
print('-----UnicodeDecodeErrorurl:', self.url)
except error.URLError as e:
print(e)
print("-----urlErrorurl:", self.url)
except socket.timeout as e:
print(e)
print("-----socket timout:", self.url)
print('parsing over!')
把页面前1000个字符解码打印出来,结果如下:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="content-type">
<title>CVPR 2018 Open Access Repository</title>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
<script type="text/javascript" src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
<script type="text/javascript" src="./static/jquery.js"></script>
<link rel="stylesheet" type="text/css" href="static/conf.css">
</head>
<body>
<div id="header">
<div id="header_left">
<a href="http://cvpr2018.thecvf.com"><img src="img/cvpr2018_logo.jpg" width="175" border="0" alt="CVPR 2018"></a>
<a href="http://www.cv-foundation.org/"><img src="img/cropped-cvf-s.jpg" width="175" height="112" border="0" alt="CVF"></a>
</div>
<div id="header_right">
<div id="header_title">
<a href="http://cvpr2018.thecvf.com">CVPR 2018</a> open access
</div>
以上就是一个html文本了,要获得论文的下载链接就需要分析这个html文本。只需把这个文本当作一个超大的字符串,那么找到论文下载链接只需分析它们满足的特征即可。
来看html的这一段内容:
[<a href="content_cvpr_2018/papers/Bai_Finding_Tiny_Faces_CVPR_2018_paper.pdf">pdf</a>]
<div class="link2">[<a class="fakelink" onclick="$(this).siblings('.bibref').slideToggle()">bibtex</a>]
<div class="bibref">
@InProceedings{Bai_2018_CVPR,<br>
author = {Bai, Yancheng and Zhang, Yongqiang and Ding, Mingli and Ghanem, Bernard},<br>
title = {Finding Tiny Faces in the Wild With Generative Adversarial Network},<br>
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},<br>
month = {June},<br>
year = {2018}<br>
}
pdf链接出现在最上面一行,它的特征是后面有一个'>pdf</a>',而文章标题出现在倒数第四行,它的特征是下一行是booktitle。
通过这两个特征把url和title解析出来一点都不难。然后可以通过脚本将title和url整理成txt文件,如下:
那么,有了下载链接和标题就可以很容易下载论文并且保存以标题命名了。urllib.request下提供了urlretrieve方法,可以用来下载远程文档。
代码
#!/usr/bin/env python
# -*- coding:utf-8 -*-
"""
@version: 1.0
@author: levio
@contact: levio.pku@gmail.com
@file: get_papers.py
"""
import os
import urllib
from urllib import request
from urllib import error
from urllib import parse
from tqdm import tqdm
class PaperDownloader:
"""
usage:
pd = PaperDownloader('CVPR2018')
pd.download_paper()
"""
def __init__(self, keyword, init=True):
print('parsing web page...')
self.url_head = 'http://openaccess.thecvf.com/'
self.keyword = keyword
self.url = self.url_head + self.keyword + '.py'
self.init = init
if init:
self.str_content = self.get_page_str()
self.directory = self.make_directory()
self.title_file_path = self.directory + self.keyword + '_title.txt'
self.url_file_path = self.directory + self.keyword + '_url.txt'
def make_directory(self):
if not os.path.exists(self.keyword):
os.system('mkdir '+self.keyword)
return self.keyword + '/'
def get_page_str(self):
headers = {'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
try:
req = request.Request(url=self.url, headers=headers)
page = request.urlopen(req, data=None, timeout=10)
str_content = page.read().decode('unicode_escape')
except UnicodeDecodeError as e:
print(e)
print('-----UnicodeDecodeErrorurl:', self.url)
except error.URLError as e:
print(e)
print("-----urlErrorurl:", self.url)
except socket.timeout as e:
print(e)
print("-----socket timout:", self.url)
print('parsing over!')
return str_content
def get_pdf_url(self):
blocks = self.str_content.split('">pdf<')
cnt = 0
url_rears = []
for i in range(len(blocks)):
rear = blocks[i].split('href="')[-1]
if rear.endswith('.pdf'):
url_rears.append(rear+'n')
cnt += 1
#print(rear)
print('got '+str(cnt)+' PDFs!')
print('got urls!')
return url_rears
def get_pdf_title(self):
blocks = self.str_content.split(',<br>nbooktitle')
cnt = 0
titles = []
for i in range(len(blocks)):
title = blocks[i].split('title = {')[-1]
cnt += 1
if title.endswith('}'):
titles.append(title.strip('}')+'n')
#print(title.strip('}'))
print('got titles!')
return titles
def write_titles_txt(self):
with open(self.title_file_path, 'w+') as f:
f.writelines(self.get_pdf_title())
print('wrote titles into text!')
def write_urls_txt(self):
with open(self.url_file_path, 'w+') as f:
f.writelines(self.get_pdf_url())
print('wrote url into text!')
def adjust_title(self, s):
ls = list(s)
symbols = [' ', ':', '/']
for i in range(len(ls)):
if ls[i] in symbols:
ls[i]='_'
return ''.join(ls)
def download_paper(self):
if self.init:
self.write_titles_txt()
self.write_urls_txt()
print('==================')
print('downloading papers:')
title_file = open(self.title_file_path)
url_file = open(self.url_file_path)
titles = title_file.readlines()
urls = url_file.readlines()
assert len(titles) == len(urls)
for i in tqdm(range(len(urls))):
try:
request.urlretrieve(self.url_head+urls[i].strip('n'),
self.directory+self.adjust_title(titles[i].strip('n'))+'.pdf')
except:
continue
def _main():
kw = ['CVPR2018', 'ECCV2018', 'ICCV2017', 'CVPR2017']
pd = PaperDownloader(kw[0])
#pd = PaperDownloader(kw[0], init=False)
#pd.download_paper()
#pd.get_pdf_url()
#pd.get_pdf_title()
#pd.write_titles_txt()
#pd.write_urls_txt()
print(pd.str_content[0:10000])
if __name__ == '__main__':
_main()