python spider novel

原创

lz527657138 2024-09-14 15:03:59 ©著作权

文章标签 python 开发语言 sed html 字符串 文章分类 JavaScript 前端开发

©著作权归作者所有：来自51CTO博客作者lz527657138的原创作品，请联系作者获取转载授权，否则将追究法律责任

python msedgedriver 获取小说

声明：只为学习/练习技术

python spider novel_开发语言

from lxml import etree
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options

import time

# 初始化EdgeOptions, 以隐藏浏览器窗口
options = Options()
# 在无节目环境下运行
options.add_argument("--headless")

driver_path = "msedgedriver.exe"
service = Service(executable_path=driver_path)
driver = webdriver.Edge(service=service, options=options)

url_prefix = "https://www.l*d*k*s*.com"
url = url_prefix + "/html/91/91737/1100841.html"
while True:
    driver.get(url)
    time.sleep(1)  # 等待响应，随便设置，单位秒，最好>=1
    page_source = driver.page_source
    tree = etree.HTML(page_source)
    # 这里获取到的是列表, 转为字符串, 且用换行符分隔
    content = "\n".join(tree.xpath("//div[@id='content']/p/text()"))
    # 这离获取到的是每一章的标题
    title = tree.xpath("//div[@class='bookname']/h1/text()")[0]
    # 下一章节的路径
    next_ur = tree.xpath("//div[@class='bookname']/div[@class='bottem1']/a[3]/@href")[0]
    # 下一章节的完整路径
    url = url_prefix + next_ur
    print(f"正在下载《{title}》...")
    with open("./mjts/mjts.txt", "a", encoding="utf-8") as file:
        file.write(title + "\n\n" + content + "\n\n")

	 # 根据观察发现，有的不是章节，是其他内容，不属于小说部分的就跳过
    if "章" not in title: 
        continue
    # 这个是目录章节，代表当前下载的是最后的章节，所以就不再获取了
    if "/html/91/91737/" == next_ur:  
        break
    time.sleep(2)  # 减小服务器的压力，随便设置，单位秒

print("下载完毕")

##### 需要注意的点

 1. 找到要下载小说的页面，F12 查看请求头的 User-Agent 中的浏览器版本,下载对应的msedgedriver.exe。下载地址： [msedgedriver下载传送门](https://registry.npmmirror.com/binary.html?path=edgedriver/)。
 2. 下载的 msedgedriver.exe 放到一个位置，将来py运行时能找到就行。
 3. 最后，有多种方式可以实现小说的获取，这只是其中的一种。