BeautifulSoup操作

转载

mb630ec035bcfe8 2022-08-31 10:20:37 博主文章分类：python及其应用

文章标签 processing 文档 python import 工具 文章分类 运维

前面向大家介绍了 PyQuery ，下面转而介绍一下 BeautifulSoup , Beautiful Soup 是 Python 内置的网页分析工具，名字叫美丽的蝴蝶。呵呵，某些时候确如美丽蝴蝶一样。
先来段介绍:
Beautiful Soup 是一个 Python HTML/XML 处理器，设计用来快速地转换网页抓取。以下的特性支撑着 Beautiful Soup：

Beautiful Soup不会选择即使你给他一个损坏的标签。他产生一个转换DOM树，尽可能和你原文档内容含义一致。这种措施通常能够你搜集数据的需求。
Beautiful Soup提供一些简单的方法以及类Python语法来查找、查找、修改一颗转换树：一个工具集帮助你解析一棵树并释出你需要的内容。你不需要为每一个应用创建自己的解析工具。
Beautiful Soup 自动将送进来的文档转换为 Unicode 编码而且在输出的时候转换为 UTF-8,。除非这个文档没有指定编码方式或者Beautiful Soup 没能自动检测编码，你需要手动指定编码方式，否则你不需要考虑编码的问题。

Beautiful Soup 转换任何你给他的内容，然后为你做那些转换的事情。你可以命令他 “找出所有的链接", 或者 "找出所有 class 是 externalLink 的链接" , 再或者是 "找出所有的链接 url 匹配 ”foo.com", 甚至是 "找出那些表头是粗体文字，然后返回给我文字“.
那些设计不好的网站中的有价值的数据可以被你一次锁定，原本要花数个小时候的工作，通过使用 Beautiful Soup 可以在几分钟内搞定。
下面让我们快速开始：
首先引用包：

1. from BeautifulSoup import BeautifulSoup          # For processing HTML
2. from BeautifulSoup import BeautifulStoneSoup     # For processing XML
3. import BeautifulSoup                             # To get everything[/font][/color]

复制代码

下面使用一段代码演示Beautiful Soup的基本使用方式。你可以拷贝与粘贴这段代码自己运行。

1. from BeautifulSoup import BeautifulSoup
2. import re
3. 
4. doc = ['<html><head><title>Page title</title></head>',
5. '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
6. '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
7. '</html>']
8. soup = BeautifulSoup(''.join(doc))
9. 
10. print soup.prettify()
11. # <html>
12. #  <head>
13. #   <title>
14. #    Page title
15. #   </title>
16. #  </head>
17. #  <body>
18. #   <p id="firstpara" align="center">
19. #    This is paragraph
20. #    <b>
21. #     one
22. #    </b>
23. #    .
24. #   </p>
25. #   <p id="secondpara" align="blah">
26. #    This is paragraph
27. #    <b>
28. #     two
29. #    </b>
30. #    .
31. #   </p>
32. #  </body>
33. # </html>

复制代码

下面是一个解析文档的方法：

1. soup.contents[0].name
2. # u'html'
3. 
4. soup.contents[0].contents[0].name
5. # u'head'
6. 
7. head = soup.contents[0].contents[0]
8. head.parent.name
9. # u'html'
10. 
11. head.next
12. # <title>Page title</title>
13. 
14. head.nextSibling.name
15. # u'body'
16. 
17. head.nextSibling.contents[0]
18. # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
19. 
20. head.nextSibling.contents[0].nextSibling
21. # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

复制代码

接着是一打方法查找一文档中包含的标签，或者含有指定属性的标签

1. titleTag = soup.html.head.title
2. titleTag
3. # <title>Page title</title>
4. 
5. titleTag.string
6. # u'Page title'
7. 
8. len(soup('p'))
9. # 2
10. 
11. soup.findAll('p', align="center")
12. # [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>]
13. 
14. soup.find('p', align="center")
15. # <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>
16. 
17. soup('p', align="center")[0]['id']
18. # u'firstpara'
19. 
20. soup.find('p', align=re.compile('^b.*'))['id']
21. # u'secondpara'
22. 
23. soup.find('p').b.string
24. # u'one'
25. 
26. soup('p')[1].b.string
27. # u'two'

复制代码

当然也可以简单地修改文档

1. titleTag['id'] = 'theTitle'
2. titleTag.contents[0].replaceWith("New title")
3. soup.html.head
4. # <head><title id="theTitle">New title</title></head>
5. 
6. soup.p.extract()
7. soup.prettify()
8. # <html>
9. #  <head>
10. #   <title id="theTitle">
11. #    New title
12. #   </title>
13. #  </head>
14. #  <body>
15. #   <p id="secondpara" align="blah">
16. #    This is paragraph
17. #    <b>
18. #     two
19. #    </b>
20. #    .
21. #   </p>
22. #  </body>
23. # </html>
24. 
25. soup.p.replaceWith(soup.b)
26. # <html>
27. #  <head>
28. #   <title id="theTitle">
29. #    New title
30. #   </title>
31. #  </head>
32. #  <body>
33. #   <b>
34. #    two
35. #   </b>
36. #  </body>
37. # </html>
38. 
39. soup.body.insert(0, "This page used to have ")
40. soup.body.insert(2, " <p> tags!")
41. soup.body
42. # <body>This page used to have <b>two</b> <p> tags!</body>