Python中tag的作用 python tag

转载

epeppanda 2024-01-03 13:20:12

文章标签 Python中tag的作用 python 网络 html a标签 文章分类 Python 后端开发

1 解析器

2 对象的种类

（1）TAG

（2）BeautifulSoup

3 信息提取

（1）文档树搜索

(2) CSS选择器

(3) 与Urlopen结合

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"html.parser")

1 解析器

Python中tag的作用 python tag_Python中tag的作用

2 对象的种类

（1）TAG

Tag类型即节点，比如HTML中的a标签、p标签等等，

Tag 标签<> </> soup.a

Name 标签的名字<p> soup.a.name

Attributes 标签的属性 soup.a.attrs

NavigableString 标签内非属性字符串 soup.a.string

Comment 标签内字符串的注释部分

#tag标签
def getHTMLTag(soup):
    print("1.1 title标签："+str(soup.title))
    print("1.2 a标签："+str(soup.a))
    print("  1.2.1 a标签的名字："+str(soup.a.name))
    print("  1.2.2 a标签的属性：" + str(soup.a.attrs))
    print("       1) 获取a标签中的onclick属性"+str(soup.a['onclick'])+str(soup.a.get('onclick')))
    print("2.1 获取标签内部的文字："+str(soup.a.string)+str(type(soup.a.string)))
    return

（2）BeautifulSoup

BeautifulSoup类型是整个文档的根形式，一种特殊的Tag类型,支持遍历文档树和搜索文档树的大部分方法，但是没有name和attrs属性。

查找所有的节点：

.contents 所有子节点存入列表

.children 子节点

.descendants 所有子孙节点

.parent 显示父亲节点

.parents 显示父辈节点

.next_sibling 下一个平行节点

.previous_sibling 上一个平行节点

.next_siblings 后续所有平行节点

.previous_siblings 前序所有平行节点

def getHTMLBeautifulSoup(soup):
    print("3. 介绍一下BeautifulSoup对象的使用。")
    print("查看body标签的父辈节点")
    for parent in soup.body.parents:
        if parent != None:
            print(parent.name)
    print("查看body前面的平行节点")
    for pre in soup.body.previous_siblings:
        if pre.name != None:
            print(pre.name)
    print("查看body的子节点的子节点")
    for des in soup.body.descendants:
        if des.name != None:
            print(des.name)
    return

（3）其他类型

NavigableString用来表示标签里的文字，不是标签（有些函数可以操作和生成 NavigableString 对象，而不是标签对象）。

comment用来查找 HTML 文档的注释标签，

3 信息提取

（1）文档树搜索

.find_all()和.find方法

soup.find_all(name,attrs,recursive,string,**kwargs)// 函数返回的是一个列表类型

参数介绍：

name:对标签名称的检索字符串，返回一个列表类型。
attrs：对标签属性值的检索字符串，可以标注属性检索
           例：   soup.find_all("p","course")
                  soup.find_all(id="link1")

recursive：是否对子孙的全部检索，默认为True
string：<>....</>中字符区域的检索字符串
           例：   soup.find_all(string="course")

keyword：需要搜索的tag属性。
limit：设定需要查找的数量。

(2) CSS选择器

通过标签名查找 soup.select(‘title’)

通过类名查找 soup.select(‘.sister’)

通过 id 名查找 soup.select(‘#link1’)

组合查找 soup.select(‘p #link1’) 查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

属性查找 soup.select(‘a[class=”sister”]’) 属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

(3) 与Urlopen结合

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(),"lxml")
        title = bsObj.body.h2
    except AttributeError as e:
        return None
    return title

title = getTitle("http://www.zuihaodaxue.com/BCSR/wangluokongjiananquan2018.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。