【Python 库】bs4的使用

原创

Python爬虫案例 2022-12-28 17:13:20 博主文章分类：Python爬虫案例 ©著作权

©著作权归作者所有：来自51CTO博客作者Python爬虫案例的原创作品，请联系作者获取转载授权，否则将追究法律责任

和lxml一样，BeautifulSoup也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

BeautifulSoup，是一个第三方的库，所以使用之前需要安装，安装方法，输入cmd,调出黑窗口，输入：

pip install bs4

它的作用是能够快速方便简单的提取网页中指定的内容，给我一个网页字符串，然后使用它的接口将网页字符串生成一个对象，然后通过这个对象的方法来提取数据。

具体的使用情况请看如下演示代码：

# !/usr/bin/env python
# _*_ coding:utf-8 _*_


from bs4 import BeautifulSoup
import re


def bs_method():
    html = """
         <html><head><title>The Dormouse's story</title></head>
         <body>
         <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
         <p class="story">Once upon a time there were three little sisters; and their names were
         <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
         <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
         <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
         and they lived at the bottom of a well.</p>
         <p class="story">...</p>
         """


    # 1.转换类型
    soup = BeautifulSoup(html, "lxml")
    # 2.find -->只找一个复合条件
    p = soup.find("p")
    p = soup.find(attrs={"class":"title"})
    p = soup.find(text="...")
    p = soup.find(re.compile("^b"))


    # 3.findall-->列表  全局搜索
    p = soup.find_all("p")
    # print len(p)


    # 4.select-->列表 全局搜索  CSS 选择器
    #ID
    #标签
    #类
    #层级选择器
    #并集选择器
    #属性选择器


    a = soup.select("#link2")
    a = soup.select("a")
    a = soup.select(".sister")
    a = soup.select("p #link2")
    a = soup.select("title,a")
    p = soup.select('p[class="story"]')[1]


    # 获取标签包裹的内容
    p_content = p.get_text()
    # 获取属性:默认是列表
    p_class = p.get("class")
    print(p_class[0])


if __name__ == '__main__':
    bs_method()