使用 Beautiful Soup

转载

mb5fdb128f2dba9 2019-03-14 18:30:00

文章标签 html xml css选择器 for循环正则表达式 文章分类 代码人生

Beautiful Soup 用法：

(1) 前面我们爬取一个网页，都是使用正则表达式来提取想要的信息，但是这种方式比较复杂，一旦有一个地方写错，就匹配不出来了，因此我们可以使用 Beautiful Soup 来进行提取
(2) Beautiful Soup 就是 Python 的一个 HTML 或 XML 的解析库，可以用它来方便地从网页中提取数据，我们可以通过 pip 来安装 Beautiful Soup：pip3 install beautifulsoup4

import re
from bs4 import BeautifulSoup

html = '''
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>首页</title>
    </head>
    <body>
        <p id="username">This is your username</p>
        <p id="password">This is your password</p>
    </body>
    </html>
'''

soup = BeautifulSoup(html, 'lxml')    # 初始化一个BeautifulSoup对象，接收一个HTML文本和'lxml'解析器作为参数，lxml解析器用于解析HTML或XML文档

result = soup.title           # 表示提取<title>节点，结果为：<title>首页</title>
result = soup.title.name      # 表示提取<title>节点的名字，结果为：title
result = soup.title.string    # 表示提取<title>节点的文本内容，结果为：首页
result = soup.head.title      # 表示提取<head>节点下的<title>节点，结果为：<title>首页</title>
result = soup.p.attrs         # 表示提取<p>节点的所有属性，结果为：{'id': 'username'}
result = soup.p.attrs['id']   # 表示提取<p>节点的属性为id的值，结果为：username
result = soup.body.contents   # 表示提取<body>节点包含的所有内容，结果为：['\n', <p id="username">This is your username</p>, '\n', <p id="password">This is your password</p>, '\n']
result = soup.body.children   # 表示提取<body>节点包含的所有内容，结果返回一个生成器，内容跟contents的内容一样，只不过我们要用for循环去遍历出来
result = soup.p.parent        # 表示提取<p>节点的父节点的内容，结果为：<body>....</body>

result = soup.find_all(name='head')                # 根据节点名来提取所有节点，结果为：[<head><meta charset="utf-8"/><title>首页</title></head>]
result = soup.find_all(attrs={'id': 'username'})   # 根据属性值来提取所有节点，结果为：[<p id="username">This is your username</p>]
result = soup.find_all(text=re.compile('your'))    # 根据文本内容来提取所有节点，结果为：['This is your username', 'This is your password']
result = soup.find_all(name='head')                # find_all()用于提取所有节点，find()用于提取匹配到的第一个节点
result = soup.select('title, #username')           # select()可以使用CSS选择器进行提取