Python爬虫：使用pyppeteer爬取动态加载的网站

原创

彭世瑜 2022-02-17 18:49:45 博主文章分类：python ©著作权

文章标签 github chrome html 文章分类 Python 后端开发

©著作权归作者所有：来自51CTO博客作者彭世瑜的原创作品，请联系作者获取转载授权，否则将追究法律责任

pyppeteer 类似selenium，可以操作Chrome浏览器

文档：https://miyakogi.github.io/pyppeteer/index.html

github: https://github.com/miyakogi/pyppeteer

安装

环境要求：

python 3.6+

pip install pyppeteer

代码示例

# -*- coding: utf-8 -*-

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq

# 最好指定一下自己浏览器的位置，如果不指定会自动下载，太慢了...
executable_path = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"


# 示例一： 渲染页面
async def crawl_page():
    # 打开浏览器
    browser = await launch(executablePath=executable_path)

    # 打开tab
    page = await browser.newPage()

    # 输入网址回车
    await page.goto('http://quotes.toscrape.com/js/')

    # 获取内容并解析
    doc = pq(await page.content())
    print('Quotes:', doc('.quote').length)

    # 关闭浏览器
    await browser.close()


# 示例二：截图，保存pdf，执行js
async def save_pdf():
    browser = await launch(executablePath=executable_path)
    page = await browser.newPage()
    await page.goto('http://quotes.toscrape.com/js/')

    # 网页截图保存
    await page.screenshot(path='example.png')

    # 网页导出 PDF 保存
    await page.pdf(path='example.pdf')

    # 执行 JavaScript
    dimensions = await page.evaluate('''() => {
            return {
                width: document.documentElement.clientWidth,
                height: document.documentElement.clientHeight,
                deviceScaleFactor: window.devicePixelRatio,
            }
        }''')

    print(dimensions)

    await browser.close()


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(crawl_page())
    # asyncio.get_event_loop().run_until_complete(save_pdf())