对于每一个计算机视觉爱好者来说,三大顶会绝对是密切关注的对象。顶会的文章通常会以pdf的形式挂在网上供以下载,那么是否可以通过爬虫把这些论文都爬下来呢?

首先,把所有论文爬下来有没有意义呢? 当然是有意义的,一方面在没网的时候也可以随时随地查论文,另一方面可以再写个pdf内容解析的脚本,对所有爬下来的论文进行分析,就像多了一双眼睛帮你看论文。

如果只想复现的同学直接拉倒文尾看代码。


先说说自己期待的爬虫脚本的样子:(描述美好的未来)

  1. 可以下载多个会议的文章:输入'CVPR2018'它就给我爬CVPR2018,输入'ECCV2018'它就给我爬ECCV2018;
  2. 只要我run这个脚本,它就给我生成一个对应名字的文件夹,比如'CVPR2018/',所有的结果都存放在'CVPR2018/'目录下;
  3. 运行的同时会把所有文档的标题存成'xxxx_title.txt',把所有的pdf链接存成'xxxx_url.txt'放在对应目录下;
  4. 有一个进度条告诉下载的进度,所有下载的论文存放在指定文件夹下。
  5. 链接失效的文章直接跳过;
  6. 只需python get_papers.py就可以搞定上述功能。

我分别运行了三次get_papers.py,下载了['CVPR2018', 'ECCV2018', 'CVPR2017']的所有论文,目录结构如下:


python 清空指定文件夹下的所有文件_bootstrap


三个文件夹里分别包含各自所有论文的pdf,如下:


python 清空指定文件夹下的所有文件_html_02


爬下来的论文效果如上,是不是满足最开始所期待的样子?接下来看看怎么实现


实现之前先要熟悉环境,爬虫代码倒是对环境的依赖性不是很大,以下是默认环境,供以参考:

  1. Ubuntu 16.04
  2. Python 3.6.5
  3. Urllib3==1.22
  4. tqdm==4.28.1

当然,硬件环境也需要额外说一下:

CVPR2018的所有文章一共2.2G,ECCV2018一共1.4G,所以你至少要预留个几G的空间来存放论文。此外,因为需要缓存页面,所以电脑的内存可以大一点(memory>=4G)

urllib是爬虫的基本模块,本程序就是通过这个模块来获取页面的。CVPR2018的文章可以在openaccess的网站上找到:http://openaccess.thecvf.com/CVPR2018.py

所以我们要做的就是hack这个页面。

第一步,获取页面内容,并且解码显示:


headers = {'User-Agent':
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
        try:
            req = request.Request(url=self.url, headers=headers)
            page = request.urlopen(req, data=None, timeout=10)
            str_content = page.read().decode('unicode_escape')
            print(str_content[0:1000])
        except UnicodeDecodeError as e:
            print(e)
            print('-----UnicodeDecodeErrorurl:', self.url)
        except error.URLError as e:
            print(e)
            print("-----urlErrorurl:", self.url)
        except socket.timeout as e:
            print(e)
            print("-----socket timout:", self.url)
        print('parsing over!')


把页面前1000个字符解码打印出来,结果如下:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="content-type">
<title>CVPR 2018 Open Access Repository</title>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
<script type="text/javascript" src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
<script type="text/javascript" src="./static/jquery.js"></script>
<link rel="stylesheet" type="text/css" href="static/conf.css">
</head>
<body>
<div id="header">
<div id="header_left">
<a href="http://cvpr2018.thecvf.com"><img src="img/cvpr2018_logo.jpg" width="175" border="0" alt="CVPR 2018"></a>
<a href="http://www.cv-foundation.org/"><img src="img/cropped-cvf-s.jpg" width="175" height="112" border="0" alt="CVF"></a>
</div>
<div id="header_right">
<div id="header_title">
<a href="http://cvpr2018.thecvf.com">CVPR 2018</a> open access
</div>


以上就是一个html文本了,要获得论文的下载链接就需要分析这个html文本。只需把这个文本当作一个超大的字符串,那么找到论文下载链接只需分析它们满足的特征即可。

来看html的这一段内容:


[<a href="content_cvpr_2018/papers/Bai_Finding_Tiny_Faces_CVPR_2018_paper.pdf">pdf</a>]
<div class="link2">[<a class="fakelink" onclick="$(this).siblings('.bibref').slideToggle()">bibtex</a>]
<div class="bibref">
@InProceedings{Bai_2018_CVPR,<br>
author = {Bai, Yancheng and Zhang, Yongqiang and Ding, Mingli and Ghanem, Bernard},<br>
title = {Finding Tiny Faces in the Wild With Generative Adversarial Network},<br>
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},<br>
month = {June},<br>
year = {2018}<br>
}


pdf链接出现在最上面一行,它的特征是后面有一个'>pdf</a>',而文章标题出现在倒数第四行,它的特征是下一行是booktitle。

通过这两个特征把url和title解析出来一点都不难。然后可以通过脚本将title和url整理成txt文件,如下:


python 清空指定文件夹下的所有文件_bootstrap_03


python 清空指定文件夹下的所有文件_python 如何清空所有的变量_04


那么,有了下载链接和标题就可以很容易下载论文并且保存以标题命名了。urllib.request下提供了urlretrieve方法,可以用来下载远程文档。


代码


#!/usr/bin/env python
# -*- coding:utf-8 -*-

"""
@version: 1.0
@author: levio
@contact: levio.pku@gmail.com
@file: get_papers.py
"""


import os
import urllib
from urllib import request
from urllib import error
from urllib import parse
from tqdm import tqdm



class PaperDownloader:
    """
    usage:
         pd = PaperDownloader('CVPR2018')
         pd.download_paper()
    """
    def __init__(self, keyword, init=True):
        print('parsing web page...')
        self.url_head = 'http://openaccess.thecvf.com/'
        self.keyword = keyword
        self.url = self.url_head + self.keyword + '.py'
        self.init = init
        if init:
            self.str_content = self.get_page_str()
        self.directory = self.make_directory()
        self.title_file_path = self.directory + self.keyword + '_title.txt'

        self.url_file_path = self.directory + self.keyword + '_url.txt'

    def make_directory(self):
        if not os.path.exists(self.keyword):
            os.system('mkdir '+self.keyword)
        return self.keyword + '/'

    def get_page_str(self):
        headers = {'User-Agent':
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
        try:
            req = request.Request(url=self.url, headers=headers)
            page = request.urlopen(req, data=None, timeout=10)
            str_content = page.read().decode('unicode_escape')
        except UnicodeDecodeError as e:
            print(e)
            print('-----UnicodeDecodeErrorurl:', self.url)  
        except error.URLError as e:
            print(e)
            print("-----urlErrorurl:", self.url)
        except socket.timeout as e:
            print(e)
            print("-----socket timout:", self.url)
        print('parsing over!')
        return str_content


    def get_pdf_url(self):
        blocks = self.str_content.split('">pdf<')
        cnt = 0
        url_rears = []
        for i in range(len(blocks)):
            rear = blocks[i].split('href="')[-1]
            if rear.endswith('.pdf'):
                url_rears.append(rear+'n')
                cnt += 1
                #print(rear)
        print('got '+str(cnt)+' PDFs!')
        print('got urls!')
        return url_rears


    def get_pdf_title(self):
        blocks = self.str_content.split(',<br>nbooktitle')
        cnt = 0
        titles = []
        for i in range(len(blocks)):
            title = blocks[i].split('title = {')[-1]
            cnt += 1
            if title.endswith('}'):
                titles.append(title.strip('}')+'n')
                #print(title.strip('}'))
        print('got titles!')
        return titles


    def write_titles_txt(self):
        with open(self.title_file_path, 'w+') as f:
            f.writelines(self.get_pdf_title())
        print('wrote titles into text!')
    
    def write_urls_txt(self):
        with open(self.url_file_path, 'w+') as f:
            f.writelines(self.get_pdf_url())
        print('wrote url into text!')

    def adjust_title(self, s):
        ls = list(s)
        symbols = [' ', ':', '/']
        for i in range(len(ls)):
            if ls[i] in symbols:
                ls[i]='_'
        return ''.join(ls)

    def download_paper(self):
        if self.init:
            self.write_titles_txt()
            self.write_urls_txt()
        print('==================')
        print('downloading papers:')
        title_file = open(self.title_file_path)
        url_file = open(self.url_file_path) 
        titles = title_file.readlines()
        urls = url_file.readlines()
        assert len(titles) == len(urls)
        for i in tqdm(range(len(urls))):
            try:
                request.urlretrieve(self.url_head+urls[i].strip('n'),
                self.directory+self.adjust_title(titles[i].strip('n'))+'.pdf')
            except:
                continue


def _main():
   kw = ['CVPR2018', 'ECCV2018', 'ICCV2017', 'CVPR2017']
   pd = PaperDownloader(kw[0])
   #pd = PaperDownloader(kw[0], init=False)
   #pd.download_paper()
   #pd.get_pdf_url()
   #pd.get_pdf_title()
   #pd.write_titles_txt()
   #pd.write_urls_txt()
   print(pd.str_content[0:10000])


if __name__ == '__main__':
    _main()