文章目录
- Parsel(from parsel import Selector)
- 1、准备操作
- 2、初始化
- 3、节点简单操作
- 4、Parsel常用的混合操作
Parsel(from parsel import Selector)
- 前面的解析库我们了解到lxml使用Xpath和PyQuery使用CSS选择器来提取页面的内容,这样我们又会这样想?会不会有一种解析库既可以使用Xpath又可以使用CSS选择器以及正则表达式呢?
- parsel是一个python的第三方库,相当于css选择器+xpath+re。
- parsel由scrapy团队开发,是将scrapy中的parsel独立抽取出来的,可以轻松解析html,xml内容,获取需要的数据。
1、准备操作
- 安装:
pip install parsel
- 官方文档:Parsel官方文档
2、初始化
- 无论是使用css选择器,还是xpath,re,都需要先创建一个parsel.Selector对象
- 创建了Selector对象之后,可以进行xpath、css的任意切换
实际上使用css进行获取节点,最终也是转换成xpath进行查询
selector = Selector(html) # 初始化Selector()对象
3、节点简单操作
将css()函数查询到的结果转换为字符串或者列表,需要使用下列函数:
- get():将css() 查询到的第一个结果转换为str类型
- getall():将css() 查询到的结果转换为python的列表
from parsel import Selector
html = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>
'''
selector = Selector(html) # 初始化Selector()对象
'''使用xpath方法获取id为images下的所有a标签'''
items = selector.xpath('//div[@id="images"]/a')
texts = items.getall() # 实际上这个就是默认给了一个循环罢了,和下面一样【可通过查看源码看出】
print(texts)
result_text = [item.xpath('./text()').get() for item in items] # 获取节点文字内容
result_href = [item.xpath('./@href').get() for item in items] # 获取节点属性
print(type(items))
print(items)
print(result_text)
print(result_href)
# 运行结果
# ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
# <class 'parsel.selector.SelectorList'>
# [<Selector xpath='//div[@id="images"]/a' data='<a href="image1.html">Name: My image ...'>, <Selector xpath='//div[@id="images"]/a' data='<a href="image2.html">Name: My image ...'>, <Selector xpath='//div[@id="images"]/a' data='<a href="image3.html">Name: My image ...'>, <Selector xpath='//div[@id="images"]/a' data='<a href="image4.html">Name: My image ...'>, <Selector xpath='//div[@id="images"]/a' data='<a href="image5.html">Name: My image ...'>]
# ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
# ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
'''使用css方法获取id为images下的所有a标签'''
print('\n')
items = selector.css('#images > a')
result_text = [item.css('::text').get() for item in items] # 获取节点文字内容
result_href = [item.css('::attr(href)').get() for item in items] # 获取节点属性
print(type(items))
print(items)
print(result_text)
print(result_href)
# 运行结果
# <class 'parsel.selector.SelectorList'>
# [<Selector xpath="descendant-or-self::*[@id = 'images']/a" data='<a href="image1.html">Name: My image ...'>, <Selector xpath="descendant-or-self::*[@id = 'images']/a" data='<a href="image2.html">Name: My image ...'>, <Selector xpath="descendant-or-self::*[@id = 'images']/a" data='<a href="image3.html">Name: My image ...'>, <Selector xpath="descendant-or-self::*[@id = 'images']/a" data='<a href="image4.html">Name: My image ...'>, <Selector xpath="descendant-or-self::*[@id = 'images']/a" data='<a href="image5.html">Name: My image ...'>]
# ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
# ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
4、Parsel常用的混合操作
- 一般在实际工作中,我比较常用的一种方式是使用css选择器先获取大的数据,在使用xpath去进行分析每一项的数据,最后如果需要仅获取里面的数值可以加上正则表达式进行获取。
- 下面写一些平时我在实际工作中比较常用的组合案例。
demo.html文件代码如下:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Parsel常用案例demo</title>
</head>
<body>
<div class="J_linkAge_container">
<ul class="product-list">
<li class="product-item" clstag="channel|keycount|09181429|home_product_16239208399">
<a href="//item.jd.com/16239208399.html" target="_blank">
<img class="p-img"
src="//img10.360buyimg.com/n2/jfs/t19429/345/2114569686/193072/6f671ab/5ae74720N203d91fc.jpg!q70.jpg">
<div class="p-name">创晟 泰国进口金枕头榴莲水果 1个2-2.5kg</div>
<div class="p-price">¥148.80</div>
</a>
</li>
<li class="product-item" clstag="channel|keycount|09181429|home_product_4838701">
<a href="//item.jd.com/4838701.html" target="_blank">
<img class="p-img"
src="//img11.360buyimg.com/n2/jfs/t19960/279/1891741367/255783/dcb7e4cc/5b5ad2b6N5cac8691.jpg!q70.jpg">
<div class="p-name">巴拜苏打泉 天然苏打水无气泡弱碱性水 非饮料 饮用水 420ml*12瓶/箱 整箱装</div>
<div class="p-price">¥69.00</div>
</a>
</li>
<li class="product-item" clstag="channel|keycount|09181429|home_product_915074">
<a href="//item.jd.com/915074.html" target="_blank">
<img class="p-img"
src="//img12.360buyimg.com/n2/jfs/t12712/256/798468754/65144/d672c64c/5a13fb39N531b4d69.jpg!q70.jpg">
<div class="p-name">洁云 雅致生活抽纸 200抽软包面巾纸*8包(新老包装交替发货)</div>
<div class="p-price">¥19.90</div>
</a>
</li>
<li class="product-item" clstag="channel|keycount|09181429|home_product_100002327718">
<a href="//item.jd.com/100002327718.html" target="_blank">
<img class="p-img"
src="//img13.360buyimg.com/n2/jfs/t1/21798/13/5448/389197/5c3ee9dbE915ae684/6577134898a39eed.jpg!q70.jpg">
<div class="p-name">康师傅 方便面 劲爽香辣牛肉面 12入桶装泡面【整箱装】</div>
<div class="p-price">¥37.00</div>
</a>
</li>
<li class="product-item" clstag="channel|keycount|09181429|home_product_6281974">
<a href="//item.jd.com/6281974.html" target="_blank">
<img class="p-img"
src="//img14.360buyimg.com/n2/jfs/t1/15821/40/4823/327447/5c36eb8eE77e95f86/31bdd89c6e17868e.jpg!q70.jpg">
<div class="p-name">宏辉果蔬 烟台红富士苹果 5kg 一级铂金果 单果190-240g 新鲜水果</div>
<div class="p-price">¥119.90</div>
</a>
</li>
</ul>
</div>
</body>
</html>
案例程序代码如下:
from parsel import Selector
with open('./demo.html', 'r', encoding='utf-8') as f:
html = f.read()
selector = Selector(html) # 初始化Selector()对象
'''获取图书的信息'''
shop_items = selector.css('.product-list li')
for shop in shop_items:
shop_name = shop.xpath('.//div[@class="p-name"]/text()').get() # 商品名称
shop_href = 'http:' + shop.xpath('./a/@href').get() # 商品详细链接
price = shop.xpath('.//div[@class="p-price"]/text()').get() # 商品价格
shop_price = shop.xpath('.//div[@class="p-price"]/text()').re('\d+\D\d+')[0] # 商品价格
print(shop_name)
print(shop_href)
print(price)
print(shop_price)
print('\n')
# 运行结果
# 创晟 泰国进口金枕头榴莲水果 1个2-2.5kg
# http://item.jd.com/16239208399.html
# ¥148.80
# 148.80
#
# 巴拜苏打泉 天然苏打水无气泡弱碱性水 非饮料 饮用水 420ml*12瓶/箱 整箱装
# http://item.jd.com/4838701.html
# ¥69.00
# 69.00
#
# 洁云 雅致生活抽纸 200抽软包面巾纸*8包(新老包装交替发货)
# http://item.jd.com/915074.html
# ¥19.90
# 19.90
#
# 康师傅 方便面 劲爽香辣牛肉面 12入桶装泡面【整箱装】
# http://item.jd.com/100002327718.html
# ¥37.00
# 37.00
#
# 宏辉果蔬 烟台红富士苹果 5kg 一级铂金果 单果190-240g 新鲜水果
# http://item.jd.com/6281974.html
# ¥119.90
# 119.90