python爬数据库数据怎么用python爬数据库的文献

page={}代表了页数信息，之后我们可以把爬取的页码填到{}中；searchWord={}代表搜索关键词，之后可以把关键词填到{}中；order代表对搜索结果进行排序，对于期刊按出版日期common_sort_time进行排序，对于会议按出版时间pro_pub_date进行排序,对于学位按学位授予时间orig_pub_date进行排序。URL的其他部分直接照搬就可以了。

首先我们要获取每个页面所有论文的url，下面是主程序的主体框架：

import re
import time

import requests
from requests import RequestException

def get_page(url):
    pass

def get_url(html,type):
    pass

def get_info(url,type):
    pass

if __name__ == '__main__':
    key_word = input('请输入搜索关键词：') #可以交互输入 也可以直接指定
    type = input('请选择论文类型(p:期刊 c:会议 d:学位 )：')
    #从哪一页开始爬 爬几页
    start_page = int(input('请输入爬取的起始页：'))
    page_num = int(input('请输入要爬取的页数(每页默认20条)：'))

    if type == 'c':
        base_url = 'http://www.wanfangdata.com.cn/search/searchList.do?beetlansyId=aysnsearch&searchType=conference&pageSize=20&page={}&searchWord={}&showType=detail&order=common_sort_time&isHit=&isHitUnit=&firstAuthor=false&navSearchType=conference&rangeParame='
    elif type == 'd':
        base_url = 'http://www.wanfangdata.com.cn/search/searchList.do?beetlansyId=aysnsearch&searchType=degree&pageSize=20&page={}&searchWord={}&showType=detail&order=pro_pub_date&isHit=&isHitUnit=&firstAuthor=false&navSearchType=degree&rangeParame='
    else:
        base_url = 'http://www.wanfangdata.com.cn/search/searchList.do?beetlansyId=aysnsearch&searchType=perio&pageSize=20&page={}&searchWord={}&showType=detail&order=orig_pub_date&isHit=&isHitUnit=&firstAuthor=false&navSearchType=perio&rangeParame='

    for page in range(int(start_page),int(start_page+page_num)):
        new_url = base_url.format(page,key_word)
        #爬取当前页面 发送请求、获取响应
        html = get_page(new_url)
        #解析响应 提取当前页面所有论文的url
        url_list = get_url(html,type)
        for url in url_list:
            #获取每篇论文的详细信息
            get_info(url,type)
            time.sleep(2) #间隔2s

发送请求、获取响应，爬取当前页面，编写get_page(url):

def get_page(url):
    try:
        # 添加User-Agent，放在headers中，伪装成浏览器
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            response.encoding = response.apparent_encoding
            return response.text
        return None
    except RequestException as e:
        print(e)
        return None

解析响应，获取当前页面所有论文的URL，编写get_url(html,type) 函数,不同type页面的论文URL有所不同，包含type信息。随便打开一篇论文，其URL如下：http://www.wanfangdata.com.cn/details/detail.do?_type=degree&id=D01662433。

其中type是类型信息，我们需要知道id，才能得到论文的URL，所以需要解析页面提取出每篇论文的id，再添加类型信息，与基础URL拼接即可：

python爬数据库数据怎么用python爬数据库的文献_beautifulsoup_02

通过检查页面源码，发现每篇论文的id出现在上图的标签中。

def get_url(html,type):
    url_list = []
    pattern = re.compile("this.id,'(.*?)'",re.S)
    ids = pattern.findall(html)

    for id in ids:
        if type == 'c':
            url_list.append('http://www.wanfangdata.com.cn/details/detail.do?_type=conference&id='+id)
        elif type == 'd':
            url_list.append('http://www.wanfangdata.com.cn/details/detail.do?_type=degree&id=' + id)
        else:
            url_list.append('http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=' + id)

    return url_list

针对各种类型的论文url，分别写一个独立的.py文件，分别爬取：

def get_info(url,type):
    if type == 'c':
        conference.main(url)
    elif type == 'd':
        degree.main(url)
    else:
        perio.main(url)

conference.py，专门爬取会议论文的相关信息，程序主体框架如下：

import os
import re

import requests
import xlrd
import xlutils.copy
import xlwt
from bs4 import BeautifulSoup
from requests import RequestException
import pymongo

def get_html(url):
    pass

def parse_html(html,url):
    pass

def save_p(paper):
    pass


def main(url):
    #发送请求、获取响应
    html = get_html(url)
    #解析响应
    paper = parse_html(html, url)
    #数据存储
    save_p(paper)

发送请求，获取响应，编写get_html(url)函数：

def get_html(url):
    try:
        # 添加User-Agent，放在headers中，伪装成浏览器
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            response.encoding = 'utf-8'
            return response.text
        return None
    except RequestException as e:
        print(e)
        return None

使用beautifulSoup解析响应，提取论文的详细信息，编写parse_html(html,url)函数：

题目：

python爬数据库数据怎么用python爬数据库的文献_python爬数据库数据_03

摘要：

python爬数据库数据怎么用python爬数据库的文献_万方文献_04

关键词：

python爬数据库数据怎么用python爬数据库的文献_万方文献_05

作者：

python爬数据库数据怎么用python爬数据库的文献_beautifulsoup_06

作者单位：

python爬数据库数据怎么用python爬数据库的文献_mongodb_07

母体文献：

python爬数据库数据怎么用python爬数据库的文献_beautifulsoup_08

会议名称：

python爬数据库数据怎么用python爬数据库的文献_python爬数据库数据_09

会议时间：

python爬数据库数据怎么用python爬数据库的文献_mongodb_10

会议地点：

python爬数据库数据怎么用python爬数据库的文献_mongodb_11

主办单位：

python爬数据库数据怎么用python爬数据库的文献_beautifulsoup_12

在线发表时间：

python爬数据库数据怎么用python爬数据库的文献_Python爬虫实战_13

def parse_html(html,url):
    #使用beautifulSoup进行解析
    soup = BeautifulSoup(html,'lxml')
    #题目
    title = soup.select('[style="font-weight:bold;"]')[0].text
    #摘要
    abstract = soup.select('.abstract')[0].textarea
    if abstract:
        abstract = abstract.text.strip()
    else:
        abstract=''

    #关键词
    keyword = soup.select('[title="知识脉络分析"][href="#"][onclick^="wfAnalysis"]') #返回列表 ^表示以什么开头 找到title=x，href=x，onclick=x的节点
    keywords = ''
    for word in keyword:
        keywords = keywords + word.text + ';'

    #作者
    author = soup.select('[onclick^="authorHome"]')
    if author:
        author = author[0].text

    #作者单位
    unit = soup.select('[class^="unit_nameType"]')
    if unit:
        unit = unit[0].text

    #母体文献
    pattern = re.compile('母体文献.*?<div class="info_right author">(.*?)</div>',re.S)
    literature = re.findall(pattern, html)
    if literature:
        literature = literature[0]
    print(literature)

    #会议名称
    conference = soup.select('[href="#"][onclick^="searchResult"]')[0].text
    print(conference)

    #会议时间
    pattern = re.compile('会议时间.*?<div class="info_right">(.*?)</div>', re.S)
    date = pattern.findall(html)
    if date:
        date = date[0].strip()

    # 会议地点
    pattern = re.compile('会议地点.*?<div class="info_right author">(.*?)</div>', re.S)
    address = re.findall(pattern, html)
    if address:
        address = address[0].strip()
    print(address)

    # 主办单位
    organizer = soup.select('[href="javascript:void(0)"][onclick^="searchResult"]')
    if organizer:
        organizer = organizer[0].text
    print(organizer)

    # 在线发表时间
    pattern = re.compile('在线出版日期.*?<div class="info_right author">(.*?)</div>', re.S)
    online_date = pattern.findall(html)
    if online_date:
        online_date = online_date[0].strip()

    #字典形式
    paper = {
        'title':title,
        'abstract':abstract,
        'keywords':keywords,
        'author':author,
        'unit':unit,
        'literature':literature,
        'conference':conference,
        'date':date,
        'address':address,
        'organizer':organizer,
        'online_date':online_date,
        'url':url
    }
    print(paper)
    return paper

注意将数据插入mongodb前，数据必须整理为字典形式。

存储结果（存到mongodb中）：

def save_p(paper):
    client = pymongo.MongoClient('mongodb://localhost:27017')
    #指定数据库
    db = client.wanfang
    #指定集合
    conference_col = db.conference
    result = conference_col.insert_one(paper)
    print(result)
    print(result.inserted_id)

爬取效果：

python爬数据库数据怎么用python爬数据库的文献_万方文献_14