java WKHtmltoPdf加无头浏览器 python 无头浏览器

转载

mob64ca14196783 2023-10-16 17:17:15

文章标签 python 爬虫 selenium Chrome html 文章分类 Java 后端开发

文章目录

1.selenium
2.抓取拉钩网-简单操作
3.窗口切换
4.无头浏览器操作
5.xpath 补充
6.总结

1.selenium

是一个脚本，模拟浏览器操作，从网页里面可以获得比较复杂的想获得的东西。
2.下载并安装环境
1）pip install selenium
2）安装浏览器驱动，将下载的浏览器驱动放到python解释器所在文件夹

2.抓取拉钩网-简单操作

拉勾网

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time 
t1 =time.time()
option = webdriver.ChromeOptions()
# 防止打印一些无用的日志
option.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
#1.打开浏览器
web =webdriver.Chrome(options=option)
#2.打开一个网址
web.get('https://www.lagou.com/')
#3.选定北京 点击
re= web.find_element_by_xpath('//*[@id="changeCityBox"]/ul/li[1]/a')
re.click()
#4.搜索框输入python 点击回车或点击按钮
web.find_element_by_xpath('//*[@id="search_input"]').send_keys('python',Keys.ENTER)
time.sleep(3)
#5.将数据读出
data_list = web.find_elements_by_xpath('//*[@id="jobList"]/div[1]/div')  #将所有的对应路径的保存
for data in data_list:
     job_name=  data.find_element_by_tag_name('a').text               #找到a标签 对应的内容
     job_price= data.find_element_by_class_name('money__3Lkgq').text  #找到class为对应的标签
     job_comp=  data.find_element_by_class_name('company-name__2-SjF').text
     print(job_name,job_price,job_comp)
t2 =time.time()
print(t2-t1)
web.close()

3.窗口切换

网页之间的切换，用window_hadles 读取打开的网页的，然后通过switch.window切换。

from tkinter import Entry
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time 
t1 =time.time()
option = webdriver.ChromeOptions()
# 防止打印一些无用的日志
option.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
web= webdriver.Chrome(options=option)
web.get('https://www.lagou.com/')
time.sleep(1)
web.find_element_by_xpath('//*[@id="cboxClose"]').click()
time.sleep(1)
web.find_element_by_xpath('//*[@id="search_input"]').send_keys('python',Keys.ENTER)
time.sleep(1)
web.find_element_by_xpath('//*[@id="jobList"]/div[1]/div[1]/div[1]/div[1]/div[1]/a').click()
time.sleep(2)
#必须要手动切换窗口
web.switch_to.window(web.window_handles[-1])
job_detail = web.find_element_by_xpath('//*[@id="job_detail"]/dd[2]/div').text
print(job_detail)
#关掉此窗口,变更窗口视角回到原来
web.close()
time.sleep(1)
web.switch_to.window(web.window_handles[0])
print(web.find_element_by_xpath('//*[@id="jobList"]/div[1]/div[1]/div[1]/div[1]/div[1]/a').text)

frame需要先找到对应的iframe框架的位置，然后后直接通过switch_to.frame切换到指定的框架。

from tkinter import Entry
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time 
t1 =time.time()
option = webdriver.ChromeOptions()
# 防止打印一些无用的日志
option.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
web= webdriver.Chrome(options=option)

web.get('http://www.wbdy.tv/play/44817_2_1.html')
time.sleep(10)
iframe = web.find_element_by_xpath('//*[@id="mplay"]')
web.switch_to.frame(iframe)  #切换掉iframe  切换回 defult_content()
res =     web.find_element_by_xpath('//*[@id="dplayer"]/div[4]/div[2]/span[1]/span[1]')
print(res)
time.sleep(10)

4.无头浏览器操作

艺恩海外票房由于艺恩中的下拉变为input方式，所以不能用select这种方式，所以用点击来实现。

from selenium import webdriver
import time 
t1 =time.time()
option = webdriver.ChromeOptions()
# 防止打印一些无用的日志
option.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
option.add_argument('--headless')  #无头浏览器 不弹出浏览器 后台操作
option.add_argument('--disbale--gpu') 
web= webdriver.Chrome(options=option)

web.get('https://ys.endata.cn/BoxOffice/Overseas')

time.sleep(5)
nums = len(web.find_elements_by_xpath('/html/body/div/div[1]/div[1]/ul/li')) #打开下拉菜单
print(nums)
for num in range(8):
     web.find_element_by_xpath('//*[@id="app"]/section/main/div/div[1]/div/section/section/div/div[1]/div/div/input').click()
     time.sleep(0.1)
     web.find_element_by_xpath(f'/html/body/div/div[1]/div[1]/ul/li[{num+1}]/span').click()
     time.sleep(5)
     res = web.find_element_by_xpath('//*[@id="app"]/section/main/div/div[1]/div/section/section/section/div[1]')
     print(res.text)

print('完成!!')
web.close()

#拿到页面代码(经过数据加载以及js执行结果后的html)
#web.page_source

5.xpath 补充

5.1舍弃使用

driver.find_elements(By.XPATH,"路径")
#等同于
driver.find_element_by_xpath('路径')

5.2结合属性查找
div可以用*代替

driver.find_elements(By.XPATH,"//div[@class='app-tree-table'])

5.3父父父代与子子子代属性查找
中间用//*/隔开

driver.find_elements(By.XPATH,"//div[@class='app-tree-table']//*/a[@target='_blank']")

6.总结

selenium就是模拟人的一些操作，不过它是自动完成，类似一种脚本。
方便的是可以获取加载后的页面代码，可以从里面获取，一些解密后的内容。
后面还有验证码一些操作。。。。还有需要明天一天。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：sql server 在数据库中的顺序号 sql server排序语句

下一篇：java 没有无大小缓冲 java找不到高速缓存条目

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯