最近需要解析HTML文件,在html解析当中,有很多包可以使用,例如dom4j, jsoup等,归根到底,他们的解析都离不开dom树,都是将其转化为一棵dom树,一个document对象来实现的。本文章主要介绍dom4j的使用方法。
dom4j介绍
dom4j是一个开源的,基于Java的库来解析XML文档,一个它具有高度的灵活性,高性能和内存效率的API。这是java的优化,使用Java集合像列表和数组。它可以使用DOM,SAX,XPath和XSLT。它解析大型XML文档时具有极低的内存占用。
dom4j有个核心的类
- Document :表示整个XML文档。文档Document对象是通常被称为DOM树。
- Element :表示一个XML元素。 Element对象有方法来操作其子元素,它的文本,属性和名称空间。
- Attribute : 表示元素的属性。属性有方法来获取和设置属性的值。它有父节点和属性类型。
- Node : 代表元素,属性或处理指令
dom4j使用方法
//要解析的文件的地址
File file = new File("spring.xml");
//创建解析工具
SAXReader xmlReader = new SAXReader();
//读取xml文件,并返回一个document对象
Document document = xmlReader.read(file);
//得到document文档的根元素
Element root = document.getRootElement();
//获取根元素root上的package的元素
Element pa = root.element("bean");
//获取这个元素的name属性值
System.out.println(pa.attributeValue("class"));
//获取元素内的文字
System.out.println(pa.getText());
//从root根元素这个位置开始循环遍历
List<Element> eles=root.elements();
示例:使用dom4j解析html文件
解析html文件,获取表格中传播指数、视频内容、视频时长、博主、点赞数、评论数。
html文件:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8"/>
<title>Title</title>
</head>
<body>
<div class="search-details-wrapper">
<div id="article" class="mp-article-list">
<div class="page-load" id="js-load-more-loading" style="display:none;"><i class="fa fa-spinner fa-spin"></i>正在加载
</div>
<div class="table-thead s-thead">
<table class="table">
<colgroup>
<col width="135"/>
<col/>
<col width="120"/>
<col width="100"/>
<col width="100"/>
<col width="186"/>
</colgroup>
<thead>
<tr>
<th>
传播指数
<div class="tooltip-box tooltip-box-c">
<div class="v-tooltip">
<a href="https://dy.feigua.cnhelp/detail/2/423.html"
target="_blank">
<i class="fa fa-question-circle-o"
aria-hidden="true"></i>
</a>
<span class="tooltiptext" style="z-index: 9999;">
由帐号近期的粉丝总量、当前视频的互动数据、音乐使用数等维度加权计算获得<br/>
数值越高,上抖音热门的几率越大
<a href="https://dy.feigua.cnhelp/detail/2/423.html"
target="_blank">【点击查看更多】</a>
</span>
</div>
</div>
</th>
<th>视频内容</th>
<th>播主</th>
<th>点赞数</th>
<th>评论数</th>
<th class="text-center">操作</th>
</tr>
</thead>
</table>
</div>
<div class="table-thead s-thead">
<table class="table">
<colgroup>
<col width="135"/>
<col/>
<col width="120"/>
<col width="100"/>
<col width="100"/>
<col width="186"/>
</colgroup>
<tbody class="js-awemelist" id="js-awemelist">
<tr data-id="10520945" data-awemeid="6909005165450300683" class="js-slider-aweme">
<td>
<span class="risk-index ">5.3</span>
</td>
<td>
<div class="media-list">
<div class="item-media">
<img src="http://img2.feigua.cn/img/tos-cn-p-0015/a61de251b9004ca0852312868ab1dc2b~c5_300x400.jpeg?from=2563711402_large-thumb"/>
</div>
<div class="item-inner">
<div class="item-title">
<a href="javascript:;" class="js-open-aweme-pop" data-id="10520945"
data-awemeid="6909005165450300683" data-active="detail">
街霸2长春石坤神操作走升龙#街霸直播 #怀旧游戏 #拳皇 #主机单机直播 #街机游戏 #经典游戏
</a>
</div>
<div class="item-times">
<lable>视频时长:</lable>
10秒
</div>
</div>
</div>
</td>
<td>
<div class="">
<div class="item-inner">
<a href="#/Blogger/Detail?id=10520945×tamp=1608710433&signature=738c92a90995b5204137988810e7e20e"
target="_blank" class="">街霸2长春石坤-王者归来</a>
<div class="item-sub-title">23 小时前</div>
</div>
</div>
</td>
<td>35</td>
<td>6</td>
<td>
<div class="mp-article-source">
<a href="javascript:;" data-id="10520945" data-awemeid="6909005165450300683"
data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="指数分析">
<i class="icon-details"></i>
</a>
<a href="javascript:;" data-id="10520945" data-awemeid="6909005165450300683"
data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
<a href="https://www.douyin.com/share/video/6909005165450300683/?mid=6909005248086821640"
target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
title="播放">
<i class="icon-play" aria-hidden="true"></i>
</a>
<a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1480450117"
data-toggle="tooltip" data-placement="top" title="收藏">
<i class="icon-star1"></i>
</a>
</div>
</td>
</tr>
<tr data-id="13621266" data-awemeid="6909036610629700879" class="js-slider-aweme">
<td>
<span class="risk-index ">0.1</span>
</td>
<td>
<div class="media-list">
<div class="item-media">
<img src="http://img2.feigua.cn/img/tos-cn-p-0015/f513a503f78849f696a9a063de72886f~c5_300x400.jpeg?from=2563711402_large-thumb"/>
</div>
<div class="item-inner">
<div class="item-title">
<a href="javascript:;" class="js-open-aweme-pop" data-id="13621266"
data-awemeid="6909036610629700879" data-active="detail">
#街霸2 #街霸女玩家大宝 双击评论加关注,每天晚上8点直播,感谢支持
</a>
</div>
<div class="item-times">
<lable>视频时长:</lable>
49秒
</div>
</div>
</div>
</td>
<td>
<div class="">
<div class="item-inner">
<a href="#/Blogger/Detail?id=13621266×tamp=1608710433&signature=fcd82fe01f24667bf8d6d1b2e389f962"
target="_blank" class="">街霸2女主播大宝</a>
<div class="item-sub-title">21 小时前</div>
</div>
</div>
</td>
<td>33</td>
<td>4</td>
<td>
<div class="mp-article-source">
<a href="javascript:;" data-id="13621266" data-awemeid="6909036610629700879"
data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="指数分析">
<i class="icon-details"></i>
</a>
<a href="javascript:;" data-id="13621266" data-awemeid="6909036610629700879"
data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
<a href="https://www.douyin.com/share/video/6909036610629700879/?mid=6909036690301127437"
target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
title="播放">
<i class="icon-play" aria-hidden="true"></i>
</a>
<a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1481361695"
data-toggle="tooltip" data-placement="top" title="收藏">
<i class="icon-star1"></i>
</a>
</div>
</td>
</tr>
<tr data-id="13621266" data-awemeid="6909025484391206159" class="js-slider-aweme">
<td>
<span class="risk-index ">0.1</span>
</td>
<td>
<div class="media-list">
<div class="item-media">
<img src="http://img2.feigua.cn/img/tos-cn-p-0015/a5cc846395cb4547b291431b49d24848~c5_300x400.jpeg?from=2563711402_large-thumb"/>
</div>
<div class="item-inner">
<div class="item-title">
<a href="javascript:;" class="js-open-aweme-pop" data-id="13621266"
data-awemeid="6909025484391206159" data-active="detail">
#热门 #街霸2 #街霸女玩家大宝 #求关注 #街机摇杆 需要摇杆加大宝vx:1220486797
</a>
</div>
<div class="item-times">
<lable>视频时长:</lable>
2分8秒
</div>
</div>
</div>
</td>
<td>
<div class="">
<div class="item-inner">
<a href="#/Blogger/Detail?id=13621266×tamp=1608710433&signature=fcd82fe01f24667bf8d6d1b2e389f962"
target="_blank" class="">街霸2女主播大宝</a>
<div class="item-sub-title">22 小时前</div>
</div>
</div>
</td>
<td>9</td>
<td>5</td>
<td>
<div class="mp-article-source">
<a href="javascript:;" data-id="13621266" data-awemeid="6909025484391206159"
data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="指数分析">
<i class="icon-details"></i>
</a>
<a href="javascript:;" data-id="13621266" data-awemeid="6909025484391206159"
data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
<a href="https://www.douyin.com/share/video/6909025484391206159/?mid=6909025602734656263"
target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
title="播放">
<i class="icon-play" aria-hidden="true"></i>
</a>
<a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1481361704"
data-toggle="tooltip" data-placement="top" title="收藏">
<i class="icon-star1"></i>
</a>
</div>
</td>
</tr>
<tr data-id="7757187" data-awemeid="6909004554730343693" class="js-slider-aweme">
<td>
<span class="risk-index ">0.1</span>
</td>
<td>
<div class="media-list">
<div class="item-media">
<img src="http://img2.feigua.cn/img/tos-cn-p-0015/8e479503a85c4f26ac05bb693f75f719~c5_300x400.jpeg?from=2563711402_large-thumb"/>
</div>
<div class="item-inner">
<div class="item-title">
<a href="javascript:;" class="js-open-aweme-pop" data-id="7757187"
data-awemeid="6909004554730343693" data-active="detail">
#永不过时的游戏大全 #街霸2 达尔希姆吐火时说的啥至今没听清楚😄
</a>
</div>
<div class="item-times">
<lable>视频时长:</lable>
59秒
</div>
</div>
</div>
</td>
<td>
<div class="">
<div class="item-inner">
<a href="#/Blogger/Detail?id=7757187×tamp=1608710433&signature=0f050e57c87c2c2b30c1201d601027c4"
target="_blank" class="">街机游戏协会会长</a>
<div class="item-sub-title">23 小时前</div>
</div>
</div>
</td>
<td>3</td>
<td>0</td>
<td>
<div class="mp-article-source">
<a href="javascript:;" data-id="7757187" data-awemeid="6909004554730343693"
data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="指数分析">
<i class="icon-details"></i>
</a>
<a href="javascript:;" data-id="7757187" data-awemeid="6909004554730343693"
data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
<a href="https://www.douyin.com/share/video/6909004554730343693/?mid=6909004646233492232"
target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
title="播放">
<i class="icon-play" aria-hidden="true"></i>
</a>
<a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1480331586"
data-toggle="tooltip" data-placement="top" title="收藏">
<i class="icon-star1"></i>
</a>
</div>
</td>
</tr>
<tr data-id="9154258" data-awemeid="6909033118515186956" class="js-slider-aweme">
<td>
<span class="risk-index ">0.1</span>
</td>
<td>
<div class="media-list">
<div class="item-media">
<img src="http://img2.feigua.cn/img/tos-cn-p-0015/fe9bff15c42c43f08cfe95c23d3abed9_1608634679~c5_300x400.jpg?from=2563711402_large-thumb"/>
</div>
<div class="item-inner">
<div class="item-title">
<a href="javascript:;" class="js-open-aweme-pop" data-id="9154258"
data-awemeid="6909033118515186956" data-active="detail">
我最后的三强赛直播精选片段 背水一战#街霸2 #背水一战 #永不过时的游戏大全
</a>
</div>
<div class="item-times">
<lable>视频时长:</lable>
3分54秒
</div>
</div>
</div>
</td>
<td>
<div class="">
<div class="item-inner">
<a href="#/Blogger/Detail?id=9154258×tamp=1608710433&signature=0bf9fd894837b32e1f014e2df54d2a37"
target="_blank" class="">拳王专业户</a>
<div class="item-sub-title">21 小时前</div>
</div>
</div>
</td>
<td>18</td>
<td>1</td>
<td>
<div class="mp-article-source">
<a href="javascript:;" data-id="9154258" data-awemeid="6909033118515186956"
data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="指数分析">
<i class="icon-details"></i>
</a>
<a href="javascript:;" data-id="9154258" data-awemeid="6909033118515186956"
data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
<a href="https://www.douyin.com/share/video/6909033118515186956/?mid=6896326384122087437"
target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
title="播放">
<i class="icon-play" aria-hidden="true"></i>
</a>
<a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1480526757"
data-toggle="tooltip" data-placement="top" title="收藏">
<i class="icon-star1"></i>
</a>
</div>
</td>
</tr>
<tr data-id="11947435" data-awemeid="6909314189307677966" class="js-slider-aweme">
<td>
<span class="risk-index ">0.1</span>
</td>
<td>
<div class="media-list">
<div class="item-media">
<img src="http://img2.feigua.cn/img/tos-cn-i-0004/ed52b168b6c24b45a6a3f77417b6a87e~c5_300x400.jpeg?from=2563711402_large-thumb"/>
</div>
<div class="item-inner">
<div class="item-title">
<a href="javascript:;" class="js-open-aweme-pop" data-id="11947435"
data-awemeid="6909314189307677966" data-active="detail">
街霸2:两位"枪王"带来精彩兵警对局!温州20强PK键盘实力派大狗熊 #街霸2 #天津西风
</a>
</div>
<div class="item-times">
<lable>视频时长:</lable>
2分12秒
</div>
</div>
</div>
</td>
<td>
<div class="">
<div class="item-inner">
<a href="#/Blogger/Detail?id=11947435×tamp=1608710433&signature=33811074707bc32a4645e6afd51dbf13"
target="_blank" class="">街霸2西风</a>
<div class="item-sub-title">3 小时前</div>
</div>
</div>
</td>
<td>8</td>
<td>0</td>
<td>
<div class="mp-article-source">
<a href="javascript:;" data-id="11947435" data-awemeid="6909314189307677966"
data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="指数分析">
<i class="icon-details"></i>
</a>
<a href="javascript:;" data-id="11947435" data-awemeid="6909314189307677966"
data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
<a href="https://www.douyin.com/share/video/6909314189307677966/?mid=6909314357843299085"
target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
title="播放">
<i class="icon-play" aria-hidden="true"></i>
</a>
<a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1481467896"
data-toggle="tooltip" data-placement="top" title="收藏">
<i class="icon-star1"></i>
</a>
</div>
</td>
</tr>
<tr data-id="11947435" data-awemeid="6909307306790391053" class="js-slider-aweme">
<td>
<span class="risk-index ">0.1</span>
</td>
<td>
<div class="media-list">
<div class="item-media">
<img src="http://img2.feigua.cn/img/tos-cn-i-0004/782186e6392d44b8a99e2a6de3c84061~c5_300x400.jpeg?from=2563711402_large-thumb"/>
</div>
<div class="item-inner">
<div class="item-title">
<a href="javascript:;" class="js-open-aweme-pop" data-id="11947435"
data-awemeid="6909307306790391053" data-active="detail">
街霸2:顶级红人狮子对局!傲视张无忌PK晓峰CE #西风 #街霸 #天津西风 #街霸2
</a>
</div>
<div class="item-times">
<lable>视频时长:</lable>
3分43秒
</div>
</div>
</div>
</td>
<td>
<div class="">
<div class="item-inner">
<a href="#/Blogger/Detail?id=11947435×tamp=1608710433&signature=33811074707bc32a4645e6afd51dbf13"
target="_blank" class="">街霸2西风</a>
<div class="item-sub-title">3 小时前</div>
</div>
</div>
</td>
<td>20</td>
<td>0</td>
<td>
<div class="mp-article-source">
<a href="javascript:;" data-id="11947435" data-awemeid="6909307306790391053"
data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="指数分析">
<i class="icon-details"></i>
</a>
<a href="javascript:;" data-id="11947435" data-awemeid="6909307306790391053"
data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
<a href="https://www.douyin.com/share/video/6909307306790391053/?mid=6909307475753814792"
target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
title="播放">
<i class="icon-play" aria-hidden="true"></i>
</a>
<a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1481467897"
data-toggle="tooltip" data-placement="top" title="收藏">
<i class="icon-star1"></i>
</a>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<div class="page-load" id="js-pager-end" style="display:none;">没有更多了~</div>
<div id="js-pager-limit"></div>
</div>
</div>
</body>
</html>
解析代码如下:
package test;
import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
import java.io.File;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Demo {
/**
* 在元素中查找复合要求的属性值的元素
*
* @param e 元素
* @param attr 属性
* @param value 值
* @return 元素
*/
private Element getElementByAttribute(Element e, String attr, String value) {
List<Element> elements = e.elements();
for (Element ele : elements) {
if (ele.attribute(attr) != null && ele.attributeValue(attr).equals(value)) {
return ele;
}
}
return null;
}
/**
* 获取视频信息
*
* @param file 文件
* @return 视频信息数组
*/
public List<String[]> getMovieInfos(File file) {
ArrayList<String[]> ret = new ArrayList<>();
SAXReader xmlReader = new SAXReader();
try {
// 找元素
Document document = xmlReader.read(file);
Element page = document.getRootElement().element("body").elements().get(0);
Element article = getElementByAttribute(page, "id", "article");
Element element = article.elements().get(2);
Element tbody = element.element("table").element("tbody");
List<Element> elements = tbody.elements();
for (int i = 0; i < elements.size(); i++) {
// 获取 传播指数、视频内容、视频时长、博主、点赞数、评论数
String[] movieInfos = new String[6];
Element tr = elements.get(i);
String spread = tr.elements().get(0).element("span").getText().trim();
String img = tr.elements().get(1).element("div").element("div").element("img").attributeValue("src").trim();
String author = tr.elements().get(2).element("div").element("div").element("a").getText().trim();
String time = tr.elements().get(2).element("div").element("div").element("div").getText().trim();
String like = tr.elements().get(3).getText().trim();
String comment = tr.elements().get(4).getText().trim();
movieInfos[0] = spread;
movieInfos[1] = img;
movieInfos[2] = author;
movieInfos[3] = time;
movieInfos[4] = like;
movieInfos[5] = comment;
ret.add(movieInfos);
}
return ret;
} catch (Exception e) {
e.printStackTrace();
}
return new ArrayList<>();
}
public static void main(String[] args) {
Demo demo = new Demo();
List<String[]> movieInfos = demo.getMovieInfos(new File("test.html"));
for (String[] movieInfo : movieInfos) {
System.out.println(Arrays.toString(movieInfo));
}
}
}
执行结果: