pyhon与html5交互 python html5lib

转载

mob6454cc68959c 2024-08-30 22:53:44

文章标签 pyhon与html5交互 c/c++ python 爬虫 html 文章分类 HTML5 移动开发

Beautiful Soup 库一般被称为bs4库，支持Python3，是我们写爬虫非常好的第三方库。因用起来十分的简便流畅。所以也被人叫做“美味汤”。下文会介绍该库的最基本的使用。

安装 Beautiful Soup

Beautiful Soup 4 通过PyPi发布，所以如果你无法使用系统包管理安装，那么也可以通过easy_install 或 pip 来安装。包的名字是 beautifulsoup4 ，这个包兼容Python2和Python3。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

如果你没有安装 easy_install 或 pip ，那你也可以下载BS4的源码，然后通过setup.py来安装。

$ Python setup.py install

安装解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是lxml。根据操作系统不同，可以选择下列方法来安装lxml：

$ apt-get install Python-lxml

$ easy_install lxml

$ pip install lxml

另一个可供选择的解析器是纯Python实现的html5lib， html5lib的解析方式与浏览器相同，可以选择下列方法来安装html5lib：

$ apt-get install Python-html5lib

$ easy_install html5lib

$ pip install html5lib

下表列出了主要的解析器,以及它们的优缺点:

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup,"html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup,"lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup,["lxml-xml"])` `BeautifulSoup(markup,"xml")`	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup,"html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

推荐使用lxml作为解析器，因为效率更高。在Python2.7.3之前的版本和Python3中3.2.2之前的版本，必须安装lxml或html5lib，因为那些Python版本的标准库中内置的HTML解析方法不够稳定。

提示：如果一段HTML或XML文档格式不正确的话，那么在不同的解析器中返回的结果可能是不一样的，查看解析器之间的区别了解更多细节。

这里我们先简单的讲解一下bs4库的使用，暂时不去考虑如何从web上抓取网页，假设我们需要爬去的html是如下这么一段：

html
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
http://example.com/elsie" class="sister" id="link1">Elsie,
http://example.com/lacie" class="sister" id="link2">Lacie and
http://example.com/tillie" class="sister" id="link3">Tillie;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>

1 #导入bs4模块
2 from bs4 import BeautifulSoup
3 soup = BeautifulSoup(html，'html.parser')
4 #输出结果
5 print(soup.prettify())

结果：

`html
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
http://example.com/elsie" class="sister" id="link1">Elsie,
http://example.com/lacie" class="sister" id="link2">Lacie and
http://example.com/tillie" class="sister" id="link3">Tillie;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

可以看到bs4库将网页文件变成了一个soup的类型，事实上，bs4库是解析、遍历、维护、“标签树“的功能库。

通俗一点说就是： bs4库把html源代码重新进行了格式化，从而方便我们对其中的节点、标签、属性等进行操作。

1 #找到文档的title
 2 print(soup.title)
 3 # <title>The Dormouse's story</title>
 4 
 5 #title的name值
 6 print(soup.title.name)
 7 # u'title'
 8 
 9 #title中的字符串String
10 print(soup.title.string)
11 # u'The Dormouse's story'
12 
13 #title的父亲节点的name属性
14 print(soup.title.parent.name)
15 # u'head'
16 
17 #文档的第一个找到的段落
18 print(soup.p)
19 # <p class="title"><b>The Dormouse's story</b></p>
20 
21 #找到的p的class属性值
22 print(soup.p['class'])
23 # u'title'
24 
25 #找到a标签
26 print(soup.a)
27 # http://example.com/elsie" id="link1">Elsie
28 
29 #找到所有的a标签
30 print(soup.find_all('a'))
31 # [http://example.com/elsie" id="link1">Elsie,
32 #  http://example.com/lacie" id="link2">Lacie,
33 #  http://example.com/tillie" id="link3">Tillie]
34 
35 #找到id值等于3的a标签
36 print(soup.find(id="link3"))
37 # http://example.com/tillie" id="link3">Tillie

从文档中找到所有<a>标签的链接：

1 for link in soup.find_all('a'):
2     print(link.get('href'))
3     # http://example.com/elsie
4     # http://example.com/lacie
5     # http://example.com/tillie

从文档中获取所有文字内容：

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种： Tag，NavigableString，BeautifulSoup，Comment 。

Tag：和html中的Tag基本没有区别，可以简单上手使用
NavigableString：被包裹在tag内的字符串
BeautifulSoup：表示一个文档的全部内容，大部分的时候可以把他看做一个tag对象，支持遍历文档树和搜索文档树方法。
Comment：这是一个特殊的NavigableSting对象，在出现在html文档中时，会以特殊的格式输出，比如注释类型。

搜索文档树的最简单的方法就是搜索你想获取tag的的name：

1 soup.head
2 # <head><title>The Dormouse's story</title></head>
3 
4 soup.title
5 # <title>The Dormouse's story</title>

如果你还想更深入的获得更小的tag：例如我们想找到body下的被b标签包裹的部分

1 soup.body.b
2 # <b>The Dormouse's story</b>

获取所有的标签，这个时候需要find_all()方法，他返回一个列表类型。

tag=soup.find_all('a')  #返回的是列表
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#假设我们要找到a标签中的第二个元素：
need = tag[1]
#简单吧

子节点

tag的.contents属性可以将tag的子节点以列表的方式输出：

1 head_tag = soup.head
 2 head_tag
 3 # <head><title>The Dormouse's story</title></head>
 4 
 5 head_tag.contents
 6 [<title>The Dormouse's story</title>]
 7 title_tag = head_tag.contents[0]
 8 print(title_tag)
 9 # <title>The Dormouse's story</title>
10 title_tag.contents
11 # [u'The Dormouse's story']

另外通过tag的 .children生成器，可以对tag的子节点进行循环：

1 for child in title_tag.children:
2     print(child)
3     # The Dormouse's story

这种方式只能遍历出子节点。如何遍历出子孙节点呢？

子孙节点：比如 head.contents 的子节点是<title>The Dormouse's story</title>,这里 title本身也有子节点：‘The Dormouse‘s story’ 。这里的‘The Dormouse‘s story’也叫作head的子孙节点.

1 for child in head_tag.descendants:
2     print(child)
3     # <title>The Dormouse's story</title>
4     # The Dormouse's story

找到tag下的所有的文本内容

如果该tag只有一个子节点（NavigableString类型）：直接使用tag.string就能找到。
如果tag有很多个子、孙节点，并且每个节点里都string：

我们可以用迭代的方式将其全部找出：

1 for string in soup.strings:
 2     print(repr(string))
 3     # u"The Dormouse's story"
 4     # u'\n\n'
 5     # u"The Dormouse's story"
 6     # u'\n\n'
 7     # u'Once upon a time there were three little sisters; and their names were\n'
 8     # u'Elsie'
 9     # u',\n'
10     # u'Lacie'
11     # u' and\n'
12     # u'Tillie'
13     # u';\nand they lived at the bottom of a well.'
14     # u'\n\n'
15     # u'...'
16     # u'\n'

输出的字符串中可能包含了很多空格或空行，使用 .stripped_strings 可以去除多余空白内容：（全部是空格的行会被忽略掉,段首和段末的空白会被删除）

1 for string in soup.stripped_strings:
 2     print(repr(string))
 3     # u"The Dormouse's story"
 4     # u"The Dormouse's story"
 5     # u'Once upon a time there were three little sisters; and their names were'
 6     # u'Elsie'
 7     # u','
 8     # u'Lacie'
 9     # u'and'
10     # u'Tillie'
11     # u';\nand they lived at the bottom of a well.'
12     # u'...'

父节点

继续分析文档树,每个tag或字符串都有父节点:被包含在某个tag中

.parent

通过 .parent 属性来获取某个元素的父节点.在例子的文档中,<head>标签是<title>标签的父节点：

1 title_tag = soup.title
2 title_tag
3 # <title>The Dormouse's story</title>
4 title_tag.parent
5 # <head><title>The Dormouse's story</title></head>

文档title的字符串也有父节点:<title>标签

1 title_tag.string.parent
2 # <title>The Dormouse's story</title>

文档的顶层节点比如<html>的父节点是 BeautifulSoup 对象:

1 html_tag = soup.html
2 type(html_tag.parent)
3 # <class 'bs4.BeautifulSoup'>

BeautifulSoup 对象的 .parent 是None:

1 print(soup.parent)
2 # None

.parents

通过元素的 .parents 属性可以递归得到元素的所有父辈节点,下面的例子使用了 .parents方法遍历了<a>标签到根节点的所有节点.

1 link = soup.a
 2 link
 3 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
 4 for parent in link.parents:
 5     if parent is None:
 6         print(parent)
 7     else:
 8         print(parent.name)
 9 # p
10 # body
11 # html
12 # [document]
13 # None

有关兄弟节点、回退和前进以及等内容和子节点、父节点差不多，详情请参考官方文档。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。