html解析大法-bs4

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little
sisters; and their names were
<a href="http://example.com/elsie" class="sister"
id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister"
id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister"
id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建一个bs对象
#默认不指定的情况,bs会选择python内部的解析器
#因此指定lxml作为解析器
soup=BeautifulSoup(html_doc,"lxml")
----------

 1. 解析网页后的类型及格式化

print(type(soup))  #<class 'bs4.BeautifulSoup'>
print(soup.prettify())  #格式化 答案如下:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little
sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

 2. 基本操作

#1.标签选择法 缺点:只能猎取到符合条件的第一个
#(1)获取p标签
print(soup.p) #<p class="title"><b>The Dormouse's story</b></p>
print(soup.a) #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
#(2)获取a标签的herf属性:因为标签下的内容是以键对形式体现出来
print(soup.a["href"]) #http://example.com/elsie
#(3)获取标签的内容
print(soup.title.string) #The Dormouse's story
print(soup.a.string)#Elsie


----------
3.遍历文档树

#1.操作子节点:contents ,children(子节点)
print(soup.body.contents) #得到文档对象中body中所有的子节点,返回的列表,包括换行符
print(soup.body.children) #<list_iterator object at 0x0000000002ACC668>
#这个意思是意味着返回的是列表迭代器,需要用for 将列表中的内容遍历出来
for i in soup.body.children:
    print(i)

answer:
<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little
sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>

#2.操作父节点:parent(父节点),parents(祖父节点)

print(soup.a.parent)
print(soup.a.parents)

<p class="story">Once upon a time there were three little
sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

#兄弟节点
next_siblings 
next_sibling
previous_siblings 
previous_sinbling
----------
4、搜索文档树

#搜索文档树
#(1)查询指定的标签
res1=soup.find_all("a")
print(res1) #返回所有的a标签,返回列表

answer:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#(2)正则表达式
#compile():声明正则表达式对象
#查找包含d字符的便签
import re
res2=soup.find_all(re.compile("d+"))
print(res2)

answer:
[<head><title>The Dormouse's story</title></head>, <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little
sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>]
#(3)列表
# 查询所有title标签或者a标签
res3=soup.find_all(["title","a"])
print(res3)

answer:
[<title>The Dormouse's story</title>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#(4)关键字参数
#查询属性id="link2"的标签
res4=soup.find_all(id="link2")
print(res4)

answer:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

#(5)内容匹配
res5=soup.find_all(text="Elsie")
print(res5)

answer:
['Elsie']

find_all()
find():返回的匹配结果的第一个元素
find_parents()               find_parent()
find_next_siblings()         find_next_sibling()
find_previous_siblings()     find_previous_sibling()
find_all_next()              find_next()
find_all_previous()          find_previous()


----------
5.CSS选择器

使用CSS选择器,只需要调用SeleCt()方法,传入相应的CSS选择器,返回类型是 liSt

CSS语法:

 - 标签名不加任何修饰
 - 类名前加点
 - id名前加 #
 - 标签a,标签b : 表示找到所有的节点a和节点c
 - 标签a 标签b : 表示找到节点a下的所有节点b
 - get_text() :获取节点内的文本内容

#(1)通过id查询标签对象
res2=soup.select("#link3")
print(res2)
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#(2)根据class属性查询标签对象
res3=soup.select(".sister")
print(res3)

#(3)属性选择
res4=soup.select("a[herf='http://example.com/elsie']")
print(res4)

#(4)包含选择器
res5=soup.select("p a#link3")#p标签下的a标签下的link3
print(res5)
answer:
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#(5)得到标签内容
res6=soup.select("p a.sister")
print(res6)
#获取标下下的内容
print(res6[0].get_text())
print(res6[0].string)

answer:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Elsie
Elsie

html解析大法-lxml-xpath 安装命令: pip install lxml

XPath常用规则

python 把读取 html转json_ci


python 把读取 html转json_ci_02


python 把读取 html转json_html_03


python 把读取 html转json_ci_04


python 把读取 html转json_ci_05


python 把读取 html转json_xml_06


etree大部分节点都在这个函数里面

from lxml import etree

html_DOC="""
<html>
    <head>
       <title>春晚</title>
    </head>
    <body>
       <h1 name="title">个人简介</h1>
       <div name="desc">
           <p name="name">姓名:<span>小岳岳</span></p>
           <p name="addr">住址:中国 河南</p>
           <p name="info">代表作:五环之歌</p>
        </div>
"""

#初始化
html=etree.HTML(html_DOC)
#print(type(html))  #<class 'lxml.etree._Element'> 获取的网页只有转化etree的格式才能被xpath解析

#(1) 查询html中div标签下的所有的p标签
print(html.xpath("//div/p"))
#三个标签p是以列表形式展现出来:[<Element p at 0x28b3ac8>, <Element p at 0x28b3a88>, <Element p at 0x28b3b88>]

#(2) 查询第二个p标签的name属性值
print(html.xpath("//div/p[2]/@name"))
#加属性值必须是用@:['addr']

#(3)查询div下的p标签下的span标签内容
print(html.xpath("//div/p/span")[0].text)
#"//div/p/span"得到的结果是[<Element span at 0x28acb08>],内容是Element元素,列表获取第一个元素,转化成text格式:小岳岳

#(4)查询body标签下的h1标签的name属性值
print(html.xpath("//body/h1/@name"))
#['title']

#(5)查询所有的包含name属性,并且name属性值为desc的标签
print(html.xpath("//*[@name='desc']"))
#查询所有的用*
#name='desc'是谓语,放在[]里
#属性值用@
#[<Element div at 0x288db48>]

#(6)查询第一个p标签内的所有文本数据
#string():获取包括子孙标签在内的所有文本数据
#text:只能获取该标签下的文本,不包括子孙标签的文本
print(html.xpath('string(//p[1])'))
print(html.xpath("//p[1]")[0].text) #只能获取第一个元素:姓名: