Python爬取信息到word制作简历 python爬取道客巴巴

转载

mob64ca13fba42b 2024-01-19 23:27:21

文章标签 Python爬取信息到word制作简历 python selenium chrome 加载 文章分类 Python 后端开发

python 下载道客巴巴文档

环境准备

首先，我们会使用到selenium这个库，直接用pip安装即可，有关于selenium的使用还需要安装浏览器驱动和配置环境变量，在这里就不过多阐述，很多博客中都有教程。

#直接使用pip安装
pip install selenium

其次，我们还需要一个库img2pdf，它可以帮助我们将多张图片合成为pdf，也是直接使用pip安装即可

#直接使用pip安装
pip install img2pdf

案例分析

今天我们的目标是下载道客巴巴中的文档，但是通过网站我们会发现，道客巴巴是通过html5的canvas元素进行各个文档内容的显示，所以，我们没有办法通过获取页面的源代码来进行解析获取图片。

Python爬取信息到word制作简历 python爬取道客巴巴_chrome

然后我就想到了使用抓包工具抓取后台返回的图片，很可惜，它返回的是加密文件

Python爬取信息到word制作简历 python爬取道客巴巴_python_02

Python爬取信息到word制作简历 python爬取道客巴巴_Python爬取信息到word制作简历_03

到这里，我就想放弃了，但是我突然灵机一动，js好像可以将canvas转为图片并下载到本地，再利用python调用js实现自动化，最后将下载到本地的图片合成转为pdf，这样子不就可以把文档保存到本地了嘛，说干就干。

下载代码实现

我们需要将我们需要的库给导入进来，以便后续调用，代码如下

from selenium import webdriver
import os
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from lxml import etree

我们需要指定图片的下载地址，这样子才方便后续对图片进行操作，刚好，selenium可以直接实现这样的操作，代码如下

chromeOptions = webdriver.ChromeOptions()
options = Options()
#获取当前的路径，拼接创建一个我们需要指定下载图片的文件夹
path=os.getcwd()+'\data'
#判断文件夹是否存在，不存在创建文件夹
is_exists = os.path.exists(path)
if not is_exists:
    os.mkdir(path)
#指定浏览器下载文件夹
prefs = {"download.default_directory": path}
options.add_experimental_option("prefs", prefs)

我们在进入页面时，需要获取当前文档共有多少页数，页码数可以在网页源代码中直接找到
如图

Python爬取信息到word制作简历 python爬取道客巴巴_selenium_04

所以我们直接利用selenium获取当前的网页源代码，使用xpath解析网页源代码获取页码数

#目标网址
url="https://www.doc88.com/p-549550988532.html"
browser.get(url)
#获取网页源代码
text=browser.page_source
html=etree.HTML(text)
#xpath解析源代码，获取总页码数
page_num=html.xpath("//li[@class='text']/text()")[0]
page_num=int(page_num.replace('/ ',''))
print(f'共{page_num}页')

网站在加载的时候，并不会将所有的图片一次加载完毕，而是会加载一定的页数，然后读者需要点击继续免费阅读全文才可以将剩余的内容加载出来，如图

Python爬取信息到word制作简历 python爬取道客巴巴_selenium_05

因此，我们需要利用selenium的点击操作，在按钮加载出来时，就点击按钮

# #等待网页加载
time.sleep(10)
#等待按钮
element=WebDriverWait(browser, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='continueButton']")))
element.click()

接下来就是下载图片啦，我们可以利用selenium调用js的功能

js = "return action=document.body.scrollHeight"
# 初始化现在滚动条所在高度为0
height = 0
# 当前窗口总高度
new_height = browser.execute_script(js)
k=0

while k<=page_num:
    for i in range(height, new_height, 3000):
        k+=1
        browser.execute_script('window.scrollTo(0, {})'.format(i))
        #每移动一定高度，停顿一秒，等待加载
        time.sleep(1)
        a = f"download({k}, {k})"
      # 中间需要手动点一下运行下载多个文件
        browser.execute_script("""function download(from, to) {
            for (i = from; i <= to; i++) {
                const pageCanvas = document.getElementById('page_' + i);
                if (pageCanvas === null) break;
                pageNo_ = i >= 10 ? ''+i:'0'+i;
                const pageNo = pageNo_;
                pageCanvas.toBlob(
                    blob => {
                        const anchor = document.createElement('a');
                        anchor.download = 'page_' + pageNo + '.png';
                        anchor.href = URL.createObjectURL(blob);
                        anchor.click();
                        URL.revokeObjectURL(anchor.href);
                    }
                );
            }
        };
        """ + a)

完整代码

from selenium import webdriver
import os
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from lxml import etree
chromeOptions = webdriver.ChromeOptions()
path=os.getcwd()+'\data'
options = Options()
#判断文件夹是否存在，不存在创建文件夹
is_exists = os.path.exists(path)
if not is_exists:
    os.mkdir(path)

#指定浏览器下载文件夹
prefs = {"download.default_directory": path}
options.add_experimental_option("prefs", prefs)
browser = webdriver.Chrome(chrome_options=options)
#指定网页链接


url='https://www.doc88.com/p-549550988532.html'
# browser.get('https://www.doc88.com/p-5969904068700.html')  #论文
browser.get(url)
#网页源代码

text=browser.page_source
html=etree.HTML(text)
page_num=html.xpath("//li[@class='text']/text()")[0]
#获取总页码数
page_num=int(page_num.replace('/ ',''))
print(f'共{page_num}页')
print(EC.visibility_of_element_located((By.XPATH, "//div[@id='continueButton']")))
# #等待网页加载
time.sleep(10)
#等待按钮
element=WebDriverWait(browser, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='continueButton']")))
element.click()


# browser.find_element_by_xpath("//div[@id='continueButton']").click()


js = "return action=document.body.scrollHeight"
# 初始化现在滚动条所在高度为0
height = 0
# 当前窗口总高度
new_height = browser.execute_script(js)
k=0

while k<=page_num:
    for i in range(height, new_height, 3000):
        k+=1
        browser.execute_script('window.scrollTo(0, {})'.format(i))
        time.sleep(1)
        a = f"downloadPages({k}, {k})"
      # 中间需要手动点一下运行下载多个文件
        browser.execute_script("""function downloadPages(from, to) {
            for (i = from; i <= to; i++) {
                const pageCanvas = document.getElementById('page_' + i);
                if (pageCanvas === null) break;
                pageNo_ = i >= 10 ? ''+i:'0'+i;
                const pageNo = pageNo_;
                pageCanvas.toBlob(
                    blob => {
                        const anchor = document.createElement('a');
                        anchor.download = 'page_' + pageNo + '.png';
                        anchor.href = URL.createObjectURL(blob);
                        anchor.click();
                        URL.revokeObjectURL(anchor.href);
                    }
                    //, 'image/jpeg' // (*)
                    //, 0.9          // (*)
                );
            }
        };
        """ + a)

注意
在运行代码的时候，浏览器会弹出是否允许下载多个文件，我们需要点击允许，才能将所有的图片都下载下来，目前我还没找到可以自动点击的方法，希望各位大佬可以给点建议哈哈
该问题可以加上进行解决：

prefs = {"download.default_directory": path,"profile.default_content_setting_values.automatic_downloads":1}

Python爬取信息到word制作简历 python爬取道客巴巴_加载_06

合成pdf

合成pdf的代码比较简单，直接看注释就好啦

import  img2pdf
import os
# 列出文件夹里面的所有文件名和路径
filepath=os.getcwd()+'\data'
files = os.listdir(filepath)
print(files)
# 排序，防止合并后文件页面乱序
filedict = {int(i.split('.')[0].split('_')[1]):i for i in files}
print(filedict)
files = [filedict[i] for i in sorted(filedict)]
# # 文件名+路径
files = ['./data/'+i for i in files]
print(files)
# # 把所有图片拼接为pdf
with open('./testpdf1.pdf', mode='wb') as f:
    f.write( img2pdf.convert(files) )