python爬sci文献 python爬取知网文献

转载

mob64ca14193248 2023-11-17 19:43:42

文章标签 python爬sci文献 python selenium爬虫加载 json 加载更多 文章分类 Python 后端开发

很多同学已经在奔赴毕业的道路上啦，都要面临一个难题，那就是写论文。不少同学为了写论文熬出了黑眼圈，却仍然一无所获，被论文折磨的死去活来，爱恨交加, 写论文必不可少的步骤就是查资料。古人云: “书读百遍其义自现”, 说不定知网翻多了，你也就知道自己的论文该如何写了。所以小编今天为大家整理了文献获取葵花plus，让你写论文不用愁。

写在前面：

本文章限于交流讨论，请不要使用文章的代码去攻击别人的服务器
如侵权联系作者删除
文中的错误已经修改过来了，谢谢各位爬友指出错误
在你复制本文章代码去运行的时候，请设置延迟，给自己留一条后路,

先看爬取的效果

python爬sci文献 python爬取知网文献_python爬sci文献

python爬sci文献 python爬取知网文献_python爬sci文献_02

知网的反爬虫手段很强，反正我爬取pc端的时候，用selenium爬取获取不到源代码，真是气人，后来换成手机端就可以获取了，爬取手机端的操作如下。

首先进入知网后，选择开发工具，建议放在右边，之后再点击图中红框的东东，然后刷新一下网页就切换到手机端了

python爬sci文献 python爬取知网文献_json_03

进入手机端的界面如下图所示(注：记得刷新网页)：

python爬sci文献 python爬取知网文献_python selenium爬虫_04

这是网址

python爬sci文献 python爬取知网文献_加载更多_05

首先在调用selenium之前设置一些参数

from selenium import webdriverfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.by import Byimport timeimport jsonimport csv# 设置谷歌驱动器的环境options = webdriver.ChromeOptions()# 设置chrome不加载图片，提高速度options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})# 创建一个谷歌驱动器browser = webdriver.Chrome(options=options)url = 'http://wap.cnki.net/touch/web/guide'

既然使用selenium，那么我们需要获取输入框的id来自动输入关键字，输入关键字之后再获取搜索的按钮，然后点击

python爬sci文献 python爬取知网文献_json_06

代码如下(输入的关键字是python)：

# 请求urlbrowser.get(url)# 显示等待输入框是否加载完成WebDriverWait(browser, 1000).until(        EC.presence_of_all_elements_located(                (By.ID, 'keyword')        ))# 找到输入框的id，并输入python关键字browser.find_element_by_id('keyword').send_keys('python')# 输入关键字之后点击搜索browser.find_element_by_id('btnSearch ').click()

在搜索关键字之后，跳转到这个界面

python爬sci文献 python爬取知网文献_python selenium爬虫_07

通过selenium的显示等待，我们可以等待一些这些信息都被加载完成了，之后在进行爬取，这样就可以避免因为元素还没有加载出来而报错，如下图我们可以知道这些信息都是在这个div标签中的，所以我们可以等待这个元素是否加载出来了

python爬sci文献 python爬取知网文献_python selenium爬虫_08

# 显示等待文献是否加载完成       WebDriverWait(browser, 1000).until(               EC.presence_of_all_elements_located(                       (By.CLASS_NAME, 'g-search-body')               )       )

往下面翻页可以看到有个加载更多，通过显示等待判断这个按钮是否加载出来，如果没有加载出来就点击的话，那就报错了

python爬sci文献 python爬取知网文献_加载_09

代码，等待这个按钮加载并获取该按钮

# 显示等待加载更多按钮加载完成          WebDriverWait(browser, 1000).until(                  EC.presence_of_all_elements_located(                          (By.CLASS_NAME, 'c-company__body-item-more')                  )          )          # 获取加载更多按钮          Btn = browser.find_element_by_class_name('c-company__body-item-more')

基本的讲的差不多了，接下来就是开始获取信息了

python爬sci文献 python爬取知网文献_python selenium爬虫_10

获取信息是爬虫的基本能力，这里就不多说了，下图的代码看注释

python爬sci文献 python爬取知网文献_json_11

上面说的意思是这样的，看图，正好我们需要的信息标签都是1，3，5，7，9以此类推，所以是2*count-1

python爬sci文献 python爬取知网文献_python selenium爬虫_12

剩下的就没有什么好说的了，代码注释基本都有写，完整代码附上

from selenium import webdriverfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.by import Byimport timeimport jsonimport csv# 设置谷歌驱动器的环境options = webdriver.ChromeOptions()# 设置chrome不加载图片，提高速度options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})# 创建一个谷歌驱动器browser = webdriver.Chrome(options=options)url = 'http://wap.cnki.net/touch/web/guide'# 声明一个全局列表，用来存储字典data_list = []def start_spider(page):    # 请求url    browser.get(url)    # 显示等待输入框是否加载完成    WebDriverWait(browser, 1000).until(        EC.presence_of_all_elements_located(            (By.ID, 'keyword')        )    )    # 找到输入框的id，并输入python关键字    browser.find_element_by_id('keyword').click()    browser.find_element_by_id('keyword_ordinary').send_keys('python')    # 输入关键字之后点击搜索    browser.find_element_by_class_name('btn-search ').click()    # print(browser.page_source)    # 显示等待文献是否加载完成    WebDriverWait(browser, 1000).until(        EC.presence_of_all_elements_located(            (By.CLASS_NAME, 'g-search-body')        )    )    # 声明一个标记，用来标记翻页几页    count = 1    while True:        # 显示等待加载更多按钮加载完成        WebDriverWait(browser, 1000).until(            EC.presence_of_all_elements_located(                (By.CLASS_NAME, 'c-company__body-item-more')            )        )        # 获取加载更多按钮        Btn = browser.find_element_by_class_name('c-company__body-item-more')        # 显示等待该信息加载完成        WebDriverWait(browser, 1000).until(            EC.presence_of_all_elements_located(                (By.XPATH, '//div[@id="searchlist_div"]/div[{}]/div[@]'.format(2*count-1))            )        )        # 获取在div标签的信息，其中format(2*count-1)是因为加载的时候有显示多少条        # 简单的说就是这些div的信息都是奇数        divs = browser.find_elements_by_xpath('//div[@id="searchlist_div"]/div[{}]/div[@]'.format(2*count-1))        # 遍历循环        for div in divs:            # 获取文献的题目            name = div.find_element_by_class_name('c-company__body-title').text            # 获取文献的作者            author = div.find_element_by_class_name('c-company__body-author').text            # 获取文献的摘要            content = div.find_element_by_class_name('c-company__body-content').text            # 获取文献的来源和日期、文献类型等            text = div.find_element_by_class_name('c-company__body-name').text.split()            if (len(text) == 3 and text[-1] == '优先') or len(text) == 2:                # 来源                source = text[0]                # 日期                datetime = text[1]                # 文献类型                literature_type = None            else:                source = text[0]                datetime = text[2]                literature_type = text[1]            # 获取下载数和被引数            temp = div.find_element_by_class_name('c-company__body-info').text.split()            # 下载数            download = temp[0].split('：')[-1]            # 被引数            cite = temp[1].split('：')[-1]            # 声明一个字典存储数据            data_dict = {}            data_dict['name'] = name            data_dict['author'] = author            data_dict['content'] = content            data_dict['source'] = source            data_dict['datetime'] = datetime            data_dict['literature_type'] = literature_type            data_dict['download'] = download            data_dict['cite'] = cite            data_list.append(data_dict)            print(data_dict)        # 如果Btn按钮(就是加载更多这个按钮)没有找到(就是已经到底了)，就退出        if not Btn:            break        else:            Btn.click()        # 如果到了爬取的页数就退出        if count == page:            break        count += 1        # 延迟两秒，我们不是在攻击服务器        time.sleep(2)def main():    start_spider(eval（input('请输入要爬取的页数(如果需要全部爬取请输入0)：')))    # 将数据写入json文件中    with open('data_json.json', 'a+', encoding='utf-8') as f:        json.dump(data_list, f, ensure_ascii=False, indent=4)    print('json文件写入完成')    # 将数据写入csv文件    with open('data_csv.csv', 'w', encoding='utf-8', newline='') as f:        # 表头        title = data_list[0].keys()        # 声明writer对象        writer = csv.DictWriter(f, title)        # 写入表头        writer.writeheader()        # 批量写入数据        writer.writerows(data_list)    print('csv文件写入完成')if __name__ == '__main__':    main()

小结:

学术的路上，道阻且长，希望小编的这段程序可以帮助大家在学术道路上走得稍微容易一点，望大家多读文献、多发文献，把学术做得更好！

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。