Python的Beautiful Soup模块笔记（1）

原创

Cabins 2012-12-13 16:40:54 ©著作权

©著作权归作者所有：来自51CTO博客作者Cabins的原创作品，请联系作者获取转载授权，否则将追究法律责任

模块的引入

对于HTML和XML

from BeautifulSoup import BeautifulSoup # For HTML 
from BeautifulSoup import BeautifulSoup # For XML

创建BeautifulSoup对象（以打开百度首页为例）

# 只保留关键代码 
import　urllib.request 
page = urllib.request.urlopen('http://www.baidu.com/') 
soup = BeautifulSoup(page)

查找HTML中指定的元素

（1）根据tag查找

# 例如 
soup.html.head.title # title tag 
soup.html.head.title.name # title tag's name 
soup.html;head.title.string # title tag's content text 
# 当然也可以 
soup.title

（2）根据HTML内容查找

import re 
soup.findAll(text = re.compile('para')) 
soup.findAll(text = re.compile('para'))[0].parent 
soup.findAll(text = re.compile('para'))[0].parent.contents

（3）根据css属性查找

soup.findAll(id = re.compile('para$')) 
soup.findAll(attrs = (id: re.compile('para$')))

编码处理

【1】内部使用Unicode编码，自动检测并转换为Unicode编码。

【2】编码检测的顺序为

1、创建soup时候传递的fromEncoding参数 
2、XML/HTML文件自己定义的编码 
3、文件开始的几个字节的编码特性 
4、若安装了chardet，用chardet检测文件编码 
5、UTF-8 
6、Windows-1252 
……

【3】默认输出的是UTF-8

【4】使用soup.original_encoding给出Beautiful Soup检测出的编码。

soup.original_encoding # bs4 syntax 
soup.originalEncoding # bs3 syntax

One More Thing - Tips

检测a标签的href属性

for link in soup.findAll('a', href = True): 
    print(link['href'])  # Py3K syntax

下一篇：用Python给PDF文件打水印

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯