python 解析html中的href

转载

晨曦微露s 2024-10-31 10:29:02

文章标签 python 解析html中的href python xml Code XML 文章分类 Python 后端开发

文章目录

1、Python解析XML方式

1.1、DOM方式

文件解析
创建修改

1.2、SAX方式
1.3、etree.Element方式

文件解析

常规解析
xpath使用
命名空间

创建修改

2、Python操作XML文件

2.1、xml文件的创建
2.2、节点的操作

XML(EXtensible Markup Language)：可扩展标记语言，被设计用来传输和存储数据。

1、Python解析XML方式

Python处理XML文件主要有三种方式：

XML.DOM模块
DOM：文件对象模型（Document Object Model），在解析XML文件时一次性将 XML 数据读取到内存中解析成一个树，通过对树的操作来操作 XML。
缺点：内存占用率高；
XML.SAX模块
SAX：基于事件驱动的API (simple API for XML )，通过在解析 XML 的过程中触发一个个的事件并调用用户定义的回调函数来处理 XML 文件。
优点：事件驱动，无需将文件全部读取到内存中；
xml.etree.ElementTree模块
优点：相比于DOM内存占用率低，性能上和SAX相比接近；
Python标准库中，提供了ET的两种实现：

xml.etree.ElementTree（推荐）；
xml.etree.cElementTree（弃用）；

xml文件A

<?xml version="1.0" encoding="UTF-8"?>
<Model xmlns:a="attribute" xmlns:c="collection" xmlns:o="object">
<c:Tables>
<o:Table Id="o3888">
<a:ObjectID>9C9E4066-4C71-4A3E-8B82-667D001B64F8</a:ObjectID>
<a:Name>缴销申请</a:Name>
<a:Code>JC_ZJK_LC_JXSQ</a:Code>
<a:CreationDate>1589337305</a:CreationDate>
<a:Creator>pws</a:Creator>
<a:ModificationDate>1593477162</a:ModificationDate>
<a:Modifier>pws</a:Modifier>
<a:Comment>流程缴销申请</a:Comment>
<a:TotalSavingCurrency/>
<a:Description>1.数据源：

2.处理规则：</a:Description>
<c:ExtendedCollections>
<o:ExtendedCollection Id="o4280">
<a:ObjectID>68AD7C9A-04CF-4911-9816-5465B7EB14A3</a:ObjectID>
<a:Name>Related Columns</a:Name>
<a:ExtendedBaseCollection.CollectionName>Related Columns</a:ExtendedBaseCollection.CollectionName>
<a:CreationDate>1619489206</a:CreationDate>
<a:Creator>pws</a:Creator>
<a:ModificationDate>1619489206</a:ModificationDate>
<a:Modifier>pws</a:Modifier>
</o:ExtendedCollection>
</c:ExtendedCollections>
<c:Columns>
<o:Column Id="o4281">
<a:ObjectID>45FFDDB7-0BE3-4B50-925E-7E76B8B9C315</a:ObjectID>
<a:Name>缴销申请UUID</a:Name>
<a:Code>JXSQUUID</a:Code>
<a:CreationDate>1589337305</a:CreationDate>
<a:Creator>pws</a:Creator>
<a:ModificationDate>1594004217</a:ModificationDate>
<a:Modifier>pws</a:Modifier>
<a:Comment>缴销申请UUID</a:Comment>
<a:DataType>VARCHAR(37)</a:DataType>
<a:Length>37</a:Length>
<a:Column.Mandatory>1</a:Column.Mandatory>
</o:Column>
<o:Column Id="o4286">
<a:ObjectID>379E9E07-B8A0-49EB-8A1A-C295A9857A0D</a:ObjectID>
<a:Name>受理日期</a:Name>
<a:Code>SLRQ</a:Code>
<a:CreationDate>1589337305</a:CreationDate>
<a:Creator>pws</a:Creator>
<a:ModificationDate>1593477162</a:ModificationDate>
<a:Modifier>pws</a:Modifier>
<a:Comment>受理日期</a:Comment>
<a:DataType>DATE</a:DataType>
</o:Column>
<o:Column Id="o4294">
<a:ObjectID>7600D929-2493-40B6-A009-BDE4119E23A6</a:ObjectID>
<a:Name>数据同步时间</a:Name>
<a:Code>SJTB_SJ</a:Code>
<a:CreationDate>1589337305</a:CreationDate>
<a:Creator>pws</a:Creator>
<a:ModificationDate>1593477162</a:ModificationDate>
<a:Modifier>pws</a:Modifier>
<a:Comment>数据同步时间</a:Comment>
<a:DataType>TIMESTAMP(6)</a:DataType>
<a:Length>6</a:Length>
</o:Column>
<o:Column Id="o4302">
<a:ObjectID>3DDCB4AE-165D-4C49-A5EB-C3B4293A3589</a:ObjectID>
<a:Name>数据集成批次号</a:Name>
<a:Code>SJJCPCH</a:Code>
<a:CreationDate>1589337305</a:CreationDate>
<a:Creator>pws</a:Creator>
<a:ModificationDate>1594004217</a:ModificationDate>
<a:Modifier>pws</a:Modifier>
<a:Comment>数据集成批次号</a:Comment>
<a:DataType>VARCHAR(20)</a:DataType>
<a:Length>20</a:Length>
</o:Column>
</c:Columns>
<c:Keys>
<o:Key Id="o4303">
<a:ObjectID>92207BC1-E331-4969-B43E-6F3231D24203</a:ObjectID>
<a:Name>主键_缴销申请</a:Name>
<a:Code>PK_JC_ZJK_LC_JXSQ</a:Code>
<a:CreationDate>1589337305</a:CreationDate>
<a:Creator>pws</a:Creator>
<a:ModificationDate>1593477162</a:ModificationDate>
<a:Modifier>pws</a:Modifier>
<a:Comment>主键_流程缴销申请</a:Comment>
<c:Key.Columns>
<o:Column Ref="o4281"/>
</c:Key.Columns>
</o:Key>
</c:Keys>
<c:PrimaryKey>
<o:Key Ref="o4303"/>
</c:PrimaryKey>
</o:Table>
</c:Tables>
</Model>

结构解析

Xml文件分为根节点、子节点；节点包含节点名称、节点文本；节点包含属性；属性包含属性名称、属性值；

1.1、DOM方式

python中使用 xml.dom.minidom来完成对xml文件的解析和操作；

Node.childNodes
Node.nodeName
Node.nodeValue
Node.hasAttributes() 说明：返回 True/False；
Node.hasChildNodes() 说明：返回 True/False；

文档对象

Document.documentElement 说明：返回文档的唯一根元素；
Document.createElement(TagName) 说明：创建并返回具有命名空间的新元素。新创建的节点需要使用方法insertBefore() 或 appendChild()显示的插入到文档中；

文件解析

1）生成dom对象

from xml.dom import minidom

domTree = minidom.parse("OnlineMovie.xml")
rootElement = domTree.documentElement

创建修改

1）创建接口

1.2、SAX方式

暂不介绍；

1.3、etree.Element方式

Python官方文档：https://docs.python.org/zh-cn/3/library/xml.etree.elementtree.html?

• element.tag 说明：返回节点名称；
• element.text 说明：返回节点文本
• element.attrib 说明：返回节点属性，结果为字典类型；
• len(element) 说明：判断元素的子节点数；
• element.get(key) 说明：获取 元素的属性名为 key 的属性值；

文件解析

常规解析

1）生成dom对象

import xml.etree.ElementTree as et

domTree=et.parse("OnlineMovie.xml")
rootElement=domTree.getroot()

2）遍历节点

函数实现

def parseXmlElement(element):

    print("tag:", element.tag, "-->text:", element.text, "; attrib:", element.attrib)
    if len(element) >1:
       for x in element:
            parseXmlElement(x)

说明：通过递归的方式循环遍历各个节点元素；

element.iter()

element.iter()

element.iter(tag="")

说明：
1）可以按照特定元素名称进行遍历，也可以遍历当前节点的所有子节点；
2）指定元素名称遍历时，只会遍历当前元素，不会遍历其子元素；

for element in rootElement.iter():
    print("tag:", element.tag, "-->text:", element.text, "; attrib:", element.attrib)

element.findall(path)

for element in rootElement.findall(path="movie"):
    print("tag:", element.tag, "-->text:", element.text, "; attrib:", element.attrib)

说明：element.findall查找文档中具备特定标签的直接元素；

for element in rootElement.findall(path="movie"):
    for x in element.iter():
        print("tag:", x.tag, "-->text:", x.text, "; attrib:", x.attrib)

说明：element.iter() 会递归解析当前节点和所有子节点；

element.find(path)

for element in rootElement.find(path="movie"):
    print("tag:", element.tag, "-->text:", element.text, "; attrib:", element.attrib)

说明：element.findall查找文档中具备特定标签第一个的子元素；

xpath使用

如下用法等同：
方式A：

rootElement.findall(".//{collection}Views/{object}View") 方式B：
ns = {"o": "object", "c": "collection", "a": "attribute"}rootElement.findall(".//c:Tables/o:Table",namespaces=ns)

代码实现

import xml.etree.ElementTree as et

document = et.parse(source="专题模型.pdm")
ns = {"o": "object", "c": "collection", "a": "attribute"}
rootElement = document.getroot()
tableList = rootElement.findall(".//c:Tables/o:Table",namespaces=ns)
viewList = rootElement.findall(".//{collection}Views/{object}View")
objectList = tableList + viewList
for element in objectList:
    objId = element.find("{attribute}ObjectID")
    Code = element.find("{attribute}Code")
    Name = element.find("{attribute}Name")

    # print("element.tag={},element.text={}".format(objId.tag,objId.text))
    print("element.tag={},element.text={}".format(Code.tag, Code.text))
    print("element.tag={},element.text={}".format(Name.tag, Name.text))

命名空间

<Model xmlns:a="attribute" xmlns:c="collection" xmlns:o="object"> 该申明指定了
 <a: 表示的命名空间前缀为 attribute；
 <c: 表示的命名空间前缀为 collection；
 <o：表示的命名空间前缀为 object；在使用 xpath 形式查找特定元素时，可以多级嵌套：
rootElement.findall(".//{object}Column")rootElement.findall(".//{collection}Columns/{object}Column")rootElement.findall(".//{object}Table/{collection}Columns/{object}Column")rootElement.findall(".//{collection}Tables/{object}Table/{collection}Columns/{object}Column")