HTML解析大法|牛逼的Beautiful Soup！

转载

wx60dacb4325b51 2021-07-05 14:12:43

文章标签 html 文章分类 Html/CSS 前端开发

HTML解析大法|牛逼的Beautiful Soup！_html

1.写在前面的话

今天给大家来讲讲强大牛逼的HTML解析库---Beautiful Soup，面对html的解析毫无压力，有多强？下面给大家慢慢道来！

HTML解析大法|牛逼的Beautiful Soup！_html_02

2.Beautiful Soup是个啥？

“

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

”

当然上面是官方介绍的，在我看来其实就是帮助我们去获取一个网页上的html数据的库，他会帮我们解析出html，并且把解析后的数据返回给我们。相对于正则表达式，可能会更加的简单好用。

其实Beautiful Soup有两个版本，我们所讲的版本是4，他还有一个版本是3，为什么不讲3呢？看官方怎么说的---“Beautiful Soup 3 目前已经停止开发,我们推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4”，没错，停止开发了，所以我们也没什么必要去学习3的知识。

3.Beautiful Soup的安装

如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装:

$ apt-get install Python-bs4

Beautiful Soup 4 通过PyPi发布,所以如果你无法使用系统包管理安装,那么也可以通过 easy_install 或 pip 来安装.包的名字是 beautifulsoup4 ,这个包兼容Python2和Python3.

$ easy_install beautifulsoup4	
$ pip install beautifulsoup4

(在PyPi中还有一个名字是 BeautifulSoup 的包,但那可能不是你想要的,那是 Beautiful Soup3 的发布版本,因为很多项目还在使用BS3, 所以 BeautifulSoup 包依然有效.但是如果你在编写新项目,那么你应该安装的 beautifulsoup4 )

如果你没有安装 easy_install 或 pip ,那你也可以下载BS4的源码 ,然后通过setup.py来安装.

$ Python setup.py install

如果上述安装方法都行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.

作者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup应该在所有当前的Python版本中正常工作(摘自官方)。

安装完soup之后，我们其实还需要去安装一个解析器：

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:

$ apt-get install Python-lxml	
$ easy_install lxml	
$ pip install lxml

另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:

$ apt-get install Python-html5lib	
$ easy_install html5lib	
$ pip install html5lib

推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

4.开始动手实践

安装完beautifulsoup之后，我们来快速使用一下它！

快速使用

首先我们需要导包 from bs4 import BeautifulSoup，然后我们来定义一串字符串，这串字符串里面是html的源码。

html_doc = """	
<html><head><title>The Dormouse's story</title></head>	
<body>	
<p class="title"><b>The Dormouse's story</b></p>	
<p class="story">Once upon a time there were three little sisters; and their names were	
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,	
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and	
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;	
and they lived at the bottom of a well.</p>	
<p class="story">...</p>	
"""

我们之后的操作都是基于上面这个字符串来的，我们使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

# 解析HTML，并且返回一个beautifulsoup对象	
soup = BeautifulSoup(html_doc,"html.parser")	
# 按照格式输出	
print(soup.prettify())

打印结果：

<html>	
 <head>	
  <title>	
   The Dormouse's story	
  </title>	
 </head>	
 <body>	
  <p class="title">	
   <b>	
    The Dormouse's story	
   </b>	
  </p>	
  <p class="story">	
   Once upon a time there were three little sisters; and their names were	
   <a class="sister" href="http://example.com/elsie" id="link1">	
    Elsie	
   </a>	
   ,	
   <a class="sister" href="http://example.com/lacie" id="link2">	
    Lacie	
   </a>	
   and	
   <a class="sister" href="http://example.com/tillie" id="link3">	
    Tillie	
   </a>	
   ;	
and they lived at the bottom of a well.	
  </p>	
  <p class="story">	
   ...	
  </p>	
 </body>	
</html>

接下来给大家演示几个常用的浏览结构化数据的方法：

print(soup.title)	
print(soup.title.name)	
print(soup.title.string)	
print(soup.title.parent.name)	
print(soup.p)	
print(soup.p['class'])	
print(soup.a)	
# 返回一个数组	
print(soup.find_all('a'))	
print(soup.find(id="link3"))

打印出结果：

<title>The Dormouse's story</title>	
title	
The Dormouse's story	
head	
<p class="title"><b>The Dormouse's story</b></p>	
['title']	
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>	
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]	
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

2.Tag对象

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag , NavigableString , BeautifulSoup , Comment .

我们先来谈谈Tag对象，Tag对象与XML或HTML原生文档中的tag相同，其实就是一个标记，举个小栗子吧：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

上面的a以及它里面的内容就称为Tag对象，怎么去抽取这些对象，其实上面的快速开始

中，我已经写过了，那些都是去拿到这个Tag对象。每个Tag对象都有它的名字，可以通过.name去获取。

Tag其实不仅仅能获取name，还能够修改name，举个小栗子：

# 将title改成mytitle	
soup.title.name="mytitle"	
print(soup.title)	
print(soup.mytitle)

输出结果：

None	
<mytitle>The Dormouse's story</mytitle>

再来说一说Tag里面的属性吧，看下面一段代码：

<p class="title"><b>The Dormouse's story</b></p>

这个就是我们上面html中的一段代码，我们可以看到里面有class并且值是title，Tag的属性的操作方法与字典相同。

print(soup.p['class'])	
print(soup.p.get('class'))

输出结果：

['title']	
['title']

其实我们也可以通过“点”来取属性，比如：.attrs，用于获取Tag中所有的属性：

print(soup.p.attrs)

输出结果：

{'class': ['title']}

2.NavigableString

有时候我们是需要获取标签中的内容，那么怎么去获取呢?这里我们就需要用到.string，给大家看下代码吧！

print(soup.p.string)

输出结果：

The Dormouse's story

BeautifulSoup用NavigableString类来包装Tag中的字符串，一个NavigableString字符和Unicode字符串相同，通过unicode()方法可以直接将NavigableString对象转换成Unicode字符串

3.搜索文档树

BeautifulSoup定义了很多的搜索方法，其中最常用的是find_all()方法，我们就拿这个来讲讲吧，其他的方法都是类似的，大家可以举一反三。

我们来看一下函数的源代码

find_all(self, name=None, attrs={}, recursive=True, text=None,	
                 limit=None, **kwargs)

name:查找到所有名字为name的标记，字符串对象会被自动忽略掉。name参数的取值可以是字符串、正则表达式、列表、True和方法。

举个小栗子：

a = soup.find_all("a")	
print(a)

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 	
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 	
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

可以看到它返回的是一个列表list

kwargs参数：kwargs在python中表示的是keyword参数。如果一个指定的名字的参数不是搜索的参数名，这个时候搜索的是指定名字的Tag的属性。搜索指定名字的属性时可以使用的参数值包括字符串、正则表达式、列表、True。

举个小栗子：

a = soup.find_all(id='link2')

输出结果：

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

还有一个小栗子也给大家看看：

a = soup.find_all(id=True)

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

可以看出当值为True时，它会获取到所有含有这个键的Tag对象。

再来一个小栗子：

a = soup.find_all("a", class_="sister")	
print(a)

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

这个应该很容易理解了，就是找到a标签下，class属性为sister的Tag对象，但是这里需要注意的是class后面需要加下划线！！！

text：通过text参数，我们可以搜索文档中的字符串内容。与name参数的可选值是相同的。

举个小栗子：

a = soup.find_all("a", text="Lacie")	
print(a)

输出结果：

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

可以看到，text参数和其他参数的完美配合！

limit：我们可以通过limit参数来限制返回的结果数量。其实效果和SQL语句中的limit效果是一样的。这里就不给大家演示了。

recursive：调用tag的find_all()方法时，Beautiful Soup会检索当前tag的所有子孙节点，如果只想搜索tag的直接子节点，可以使用该参数并且将值为False。

上一篇：学习建议，大数据组件那么多，可以重点学习这几个

下一篇：R语言爬虫的html基础

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

HTML解析大法|牛逼的Beautiful Soup！

HTML解析大法|牛逼的Beautiful Soup！

51CTO博客