最近需要解析HTML文件,在html解析当中,有很多包可以使用,例如dom4j, jsoup等,归根到底,他们的解析都离不开dom树,都是将其转化为一棵dom树,一个document对象来实现的。本文章主要介绍dom4j的使用方法。

dom4j介绍

dom4j是一个开源的,基于Java的库来解析XML文档,一个它具有高度的灵活性,高性能和内存效率的API。这是java的优化,使用Java集合像列表和数组。它可以使用DOM,SAX,XPath和XSLT。它解析大型XML文档时具有极低的内存占用。

dom4j有个核心的类

  • Document :表示整个XML文档。文档Document对象是通常被称为DOM树。
  • Element :表示一个XML元素。 Element对象有方法来操作其子元素,它的文本,属性和名称空间。
  • Attribute : 表示元素的属性。属性有方法来获取和设置属性的值。它有父节点和属性类型。
  • Node : 代表元素,属性或处理指令

dom4j使用方法

//要解析的文件的地址
File file = new File("spring.xml");

//创建解析工具
SAXReader xmlReader = new SAXReader();

//读取xml文件,并返回一个document对象
Document document = xmlReader.read(file);

//得到document文档的根元素
Element root = document.getRootElement();

//获取根元素root上的package的元素
Element pa = root.element("bean");

//获取这个元素的name属性值
System.out.println(pa.attributeValue("class"));

//获取元素内的文字
System.out.println(pa.getText());

//从root根元素这个位置开始循环遍历
List<Element> eles=root.elements();

示例:使用dom4j解析html文件

解析html文件,获取表格中传播指数、视频内容、视频时长、博主、点赞数、评论数。

html文件:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8"/>
    <title>Title</title>
</head>
<body>
<div class="search-details-wrapper">
    <div id="article" class="mp-article-list">
            <div class="page-load" id="js-load-more-loading" style="display:none;"><i class="fa fa-spinner fa-spin"></i>正在加载
            </div>
            <div class="table-thead  s-thead">
                <table class="table">
                    <colgroup>
                        <col width="135"/>
                        <col/>
                        <col width="120"/>
                        <col width="100"/>
                        <col width="100"/>
                        <col width="186"/>
                    </colgroup>
                    <thead>
                    <tr>
                        <th>
                            传播指数

                            <div class="tooltip-box tooltip-box-c">
                                <div class="v-tooltip">
                                    <a href="https://dy.feigua.cnhelp/detail/2/423.html"
                                       target="_blank">
                                        <i class="fa fa-question-circle-o"
                                           aria-hidden="true"></i>
                                    </a>
                                    <span class="tooltiptext" style="z-index: 9999;">
                                                由帐号近期的粉丝总量、当前视频的互动数据、音乐使用数等维度加权计算获得<br/>

                                                数值越高,上抖音热门的几率越大
                                                <a href="https://dy.feigua.cnhelp/detail/2/423.html"
                                                   target="_blank">【点击查看更多】</a>

                                            </span>
                                </div>
                            </div>
                        </th>
                        <th>视频内容</th>
                        <th>播主</th>
                        <th>点赞数</th>
                        <th>评论数</th>
                        <th class="text-center">操作</th>
                    </tr>
                    </thead>
                </table>
            </div>
            <div class="table-thead  s-thead">
                <table class="table">
                    <colgroup>
                        <col width="135"/>
                        <col/>
                        <col width="120"/>
                        <col width="100"/>
                        <col width="100"/>
                        <col width="186"/>
                    </colgroup>
                    <tbody class="js-awemelist" id="js-awemelist">


                    <tr data-id="10520945" data-awemeid="6909005165450300683" class="js-slider-aweme">
                        <td>
                            <span class="risk-index ">5.3</span>
                        </td>
                        <td>
                            <div class="media-list">
                                <div class="item-media">
                                    <img src="http://img2.feigua.cn/img/tos-cn-p-0015/a61de251b9004ca0852312868ab1dc2b~c5_300x400.jpeg?from=2563711402_large-thumb"/>
                                </div>
                                <div class="item-inner">
                                    <div class="item-title">
                                        <a href="javascript:;" class="js-open-aweme-pop" data-id="10520945"
                                           data-awemeid="6909005165450300683" data-active="detail">
                                            街霸2长春石坤神操作走升龙#街霸直播 #怀旧游戏 #拳皇 #主机单机直播 #街机游戏 #经典游戏

                                        </a>
                                    </div>

                                    <div class="item-times">
                                        <lable>视频时长:</lable>
                                        10秒
                                    </div>
                                </div>
                            </div>
                        </td>
                        <td>
                            <div class="">
                                <div class="item-inner">
                                    <a href="#/Blogger/Detail?id=10520945×tamp=1608710433&signature=738c92a90995b5204137988810e7e20e"
                                           target="_blank" class="">街霸2长春石坤-王者归来</a>
                                    <div class="item-sub-title">23 小时前</div>
                                </div>
                            </div>
                        </td>
                        <td>35</td>
                        <td>6</td>
                        <td>
                            <div class="mp-article-source">
                                <a href="javascript:;" data-id="10520945" data-awemeid="6909005165450300683"
                                   data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="指数分析">
                                    <i class="icon-details"></i>
                                </a>
                                <a href="javascript:;" data-id="10520945" data-awemeid="6909005165450300683"
                                   data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
                                <a href="https://www.douyin.com/share/video/6909005165450300683/?mid=6909005248086821640"
                                   target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
                                   title="播放">
                                    <i class="icon-play" aria-hidden="true"></i>
                                </a>
                                <a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1480450117"
                                   data-toggle="tooltip" data-placement="top" title="收藏">
                                    <i class="icon-star1"></i>
                                </a>

                            </div>
                        </td>
                    </tr>
                    <tr data-id="13621266" data-awemeid="6909036610629700879" class="js-slider-aweme">
                        <td>
                            <span class="risk-index ">0.1</span>
                        </td>
                        <td>
                            <div class="media-list">
                                <div class="item-media">

                                        <img src="http://img2.feigua.cn/img/tos-cn-p-0015/f513a503f78849f696a9a063de72886f~c5_300x400.jpeg?from=2563711402_large-thumb"/>
                                </div>
                                <div class="item-inner">
                                    <div class="item-title">
                                        <a href="javascript:;" class="js-open-aweme-pop" data-id="13621266"
                                           data-awemeid="6909036610629700879" data-active="detail">
                                            #街霸2 #街霸女玩家大宝 双击评论加关注,每天晚上8点直播,感谢支持

                                        </a>
                                    </div>

                                    <div class="item-times">
                                        <lable>视频时长:</lable>
                                        49秒
                                    </div>
                                </div>
                            </div>
                        </td>
                        <td>
                            <div class="">
                                <div class="item-inner">
                                    <a href="#/Blogger/Detail?id=13621266×tamp=1608710433&signature=fcd82fe01f24667bf8d6d1b2e389f962"
                                           target="_blank" class="">街霸2女主播大宝</a>
                                    <div class="item-sub-title">21 小时前</div>
                                </div>
                            </div>
                        </td>
                        <td>33</td>
                        <td>4</td>
                        <td>
                            <div class="mp-article-source">
                                <a href="javascript:;" data-id="13621266" data-awemeid="6909036610629700879"
                                   data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="指数分析">
                                    <i class="icon-details"></i>
                                </a>
                                <a href="javascript:;" data-id="13621266" data-awemeid="6909036610629700879"
                                   data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
                                <a href="https://www.douyin.com/share/video/6909036610629700879/?mid=6909036690301127437"
                                   target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
                                   title="播放">
                                    <i class="icon-play" aria-hidden="true"></i>
                                </a>
                                <a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1481361695"
                                   data-toggle="tooltip" data-placement="top" title="收藏">
                                    <i class="icon-star1"></i>
                                </a>

                            </div>
                        </td>
                    </tr>
                    <tr data-id="13621266" data-awemeid="6909025484391206159" class="js-slider-aweme">
                        <td>
                            <span class="risk-index ">0.1</span>
                        </td>
                        <td>
                            <div class="media-list">
                                <div class="item-media">
                                    <img src="http://img2.feigua.cn/img/tos-cn-p-0015/a5cc846395cb4547b291431b49d24848~c5_300x400.jpeg?from=2563711402_large-thumb"/>

                                </div>
                                <div class="item-inner">
                                    <div class="item-title">
                                        <a href="javascript:;" class="js-open-aweme-pop" data-id="13621266"
                                           data-awemeid="6909025484391206159" data-active="detail">
                                            #热门 #街霸2 #街霸女玩家大宝 #求关注 #街机摇杆 需要摇杆加大宝vx:1220486797

                                        </a>
                                    </div>

                                    <div class="item-times">
                                        <lable>视频时长:</lable>
                                        2分8秒
                                    </div>
                                </div>
                            </div>
                        </td>
                        <td>
                            <div class="">
                                <div class="item-inner">
                                    <a href="#/Blogger/Detail?id=13621266×tamp=1608710433&signature=fcd82fe01f24667bf8d6d1b2e389f962"
                                           target="_blank" class="">街霸2女主播大宝</a>

                                    <div class="item-sub-title">22 小时前</div>
                                </div>
                            </div>
                        </td>
                        <td>9</td>
                        <td>5</td>
                        <td>
                            <div class="mp-article-source">
                                <a href="javascript:;" data-id="13621266" data-awemeid="6909025484391206159"
                                   data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="指数分析">
                                    <i class="icon-details"></i>
                                </a>
                                <a href="javascript:;" data-id="13621266" data-awemeid="6909025484391206159"
                                   data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
                                <a href="https://www.douyin.com/share/video/6909025484391206159/?mid=6909025602734656263"
                                   target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
                                   title="播放">
                                    <i class="icon-play" aria-hidden="true"></i>
                                </a>
                                <a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1481361704"
                                   data-toggle="tooltip" data-placement="top" title="收藏">
                                    <i class="icon-star1"></i>
                                </a>

                            </div>
                        </td>
                    </tr>
                    <tr data-id="7757187" data-awemeid="6909004554730343693" class="js-slider-aweme">
                        <td>
                            <span class="risk-index ">0.1</span>
                        </td>
                        <td>
                            <div class="media-list">
                                <div class="item-media">
                                    <img src="http://img2.feigua.cn/img/tos-cn-p-0015/8e479503a85c4f26ac05bb693f75f719~c5_300x400.jpeg?from=2563711402_large-thumb"/>

                                </div>
                                <div class="item-inner">
                                    <div class="item-title">
                                        <a href="javascript:;" class="js-open-aweme-pop" data-id="7757187"
                                           data-awemeid="6909004554730343693" data-active="detail">
                                            #永不过时的游戏大全 #街霸2 达尔希姆吐火时说的啥至今没听清楚😄

                                        </a>
                                    </div>

                                    <div class="item-times">
                                        <lable>视频时长:</lable>
                                        59秒
                                    </div>
                                </div>
                            </div>
                        </td>
                        <td>
                            <div class="">
                                <div class="item-inner">
                                    <a href="#/Blogger/Detail?id=7757187×tamp=1608710433&signature=0f050e57c87c2c2b30c1201d601027c4"
                                           target="_blank" class="">街机游戏协会会长</a>
                                    <div class="item-sub-title">23 小时前</div>
                                </div>
                            </div>
                        </td>
                        <td>3</td>
                        <td>0</td>
                        <td>
                            <div class="mp-article-source">
                                <a href="javascript:;" data-id="7757187" data-awemeid="6909004554730343693"
                                   data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="指数分析">
                                    <i class="icon-details"></i>
                                </a>
                                <a href="javascript:;" data-id="7757187" data-awemeid="6909004554730343693"
                                   data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
                                <a href="https://www.douyin.com/share/video/6909004554730343693/?mid=6909004646233492232"
                                   target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
                                   title="播放">
                                    <i class="icon-play" aria-hidden="true"></i>
                                </a>
                                <a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1480331586"
                                   data-toggle="tooltip" data-placement="top" title="收藏">
                                    <i class="icon-star1"></i>
                                </a>

                            </div>
                        </td>
                    </tr>
                    <tr data-id="9154258" data-awemeid="6909033118515186956" class="js-slider-aweme">
                        <td>
                            <span class="risk-index ">0.1</span>
                        </td>
                        <td>
                            <div class="media-list">
                                <div class="item-media">
                                    <img src="http://img2.feigua.cn/img/tos-cn-p-0015/fe9bff15c42c43f08cfe95c23d3abed9_1608634679~c5_300x400.jpg?from=2563711402_large-thumb"/>

                                </div>
                                <div class="item-inner">
                                    <div class="item-title">
                                        <a href="javascript:;" class="js-open-aweme-pop" data-id="9154258"
                                           data-awemeid="6909033118515186956" data-active="detail">
                                            我最后的三强赛直播精选片段 背水一战#街霸2 #背水一战 #永不过时的游戏大全

                                        </a>
                                    </div>

                                    <div class="item-times">
                                        <lable>视频时长:</lable>
                                        3分54秒
                                    </div>
                                </div>
                            </div>
                        </td>
                        <td>
                            <div class="">
                                <div class="item-inner">
                                    <a href="#/Blogger/Detail?id=9154258×tamp=1608710433&signature=0bf9fd894837b32e1f014e2df54d2a37"
                                           target="_blank" class="">拳王专业户</a>
                                    <div class="item-sub-title">21 小时前</div>
                                </div>
                            </div>
                        </td>
                        <td>18</td>
                        <td>1</td>
                        <td>
                            <div class="mp-article-source">
                                <a href="javascript:;" data-id="9154258" data-awemeid="6909033118515186956"
                                   data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="指数分析">
                                    <i class="icon-details"></i>
                                </a>
                                <a href="javascript:;" data-id="9154258" data-awemeid="6909033118515186956"
                                   data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
                                <a href="https://www.douyin.com/share/video/6909033118515186956/?mid=6896326384122087437"
                                   target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
                                   title="播放">
                                    <i class="icon-play" aria-hidden="true"></i>
                                </a>
                                <a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1480526757"
                                   data-toggle="tooltip" data-placement="top" title="收藏">
                                    <i class="icon-star1"></i>
                                </a>

                            </div>
                        </td>
                    </tr>
                    <tr data-id="11947435" data-awemeid="6909314189307677966" class="js-slider-aweme">
                        <td>
                            <span class="risk-index ">0.1</span>
                        </td>
                        <td>
                            <div class="media-list">
                                <div class="item-media">
                                    <img src="http://img2.feigua.cn/img/tos-cn-i-0004/ed52b168b6c24b45a6a3f77417b6a87e~c5_300x400.jpeg?from=2563711402_large-thumb"/>

                                </div>
                                <div class="item-inner">
                                    <div class="item-title">
                                        <a href="javascript:;" class="js-open-aweme-pop" data-id="11947435"
                                           data-awemeid="6909314189307677966" data-active="detail">
                                            街霸2:两位"枪王"带来精彩兵警对局!温州20强PK键盘实力派大狗熊 #街霸2 #天津西风

                                        </a>
                                    </div>

                                    <div class="item-times">
                                        <lable>视频时长:</lable>
                                        2分12秒
                                    </div>
                                </div>
                            </div>
                        </td>
                        <td>
                            <div class="">
                                <div class="item-inner">
                                    <a href="#/Blogger/Detail?id=11947435×tamp=1608710433&signature=33811074707bc32a4645e6afd51dbf13"
                                           target="_blank" class="">街霸2西风</a>
                                    <div class="item-sub-title">3 小时前</div>
                                </div>
                            </div>
                        </td>
                        <td>8</td>
                        <td>0</td>
                        <td>
                            <div class="mp-article-source">
                                <a href="javascript:;" data-id="11947435" data-awemeid="6909314189307677966"
                                   data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="指数分析">
                                    <i class="icon-details"></i>
                                </a>
                                <a href="javascript:;" data-id="11947435" data-awemeid="6909314189307677966"
                                   data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
                                <a href="https://www.douyin.com/share/video/6909314189307677966/?mid=6909314357843299085"
                                   target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
                                   title="播放">
                                    <i class="icon-play" aria-hidden="true"></i>
                                </a>
                                <a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1481467896"
                                   data-toggle="tooltip" data-placement="top" title="收藏">
                                    <i class="icon-star1"></i>
                                </a>

                            </div>
                        </td>
                    </tr>
                    <tr data-id="11947435" data-awemeid="6909307306790391053" class="js-slider-aweme">
                        <td>
                            <span class="risk-index ">0.1</span>
                        </td>
                        <td>
                            <div class="media-list">
                                <div class="item-media">
                                    <img src="http://img2.feigua.cn/img/tos-cn-i-0004/782186e6392d44b8a99e2a6de3c84061~c5_300x400.jpeg?from=2563711402_large-thumb"/>
                                </div>
                                <div class="item-inner">
                                    <div class="item-title">
                                        <a href="javascript:;" class="js-open-aweme-pop" data-id="11947435"
                                           data-awemeid="6909307306790391053" data-active="detail">
                                            街霸2:顶级红人狮子对局!傲视张无忌PK晓峰CE #西风 #街霸 #天津西风 #街霸2

                                        </a>
                                    </div>

                                    <div class="item-times">
                                        <lable>视频时长:</lable>
                                        3分43秒
                                    </div>
                                </div>
                            </div>
                        </td>
                        <td>
                            <div class="">
                                <div class="item-inner">
                                    <a href="#/Blogger/Detail?id=11947435×tamp=1608710433&signature=33811074707bc32a4645e6afd51dbf13"
                                           target="_blank" class="">街霸2西风</a>
                                    <div class="item-sub-title">3 小时前</div>
                                </div>
                            </div>
                        </td>
                        <td>20</td>
                        <td>0</td>
                        <td>
                            <div class="mp-article-source">
                                <a href="javascript:;" data-id="11947435" data-awemeid="6909307306790391053"
                                   data-active="detail" class="source-details js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="指数分析">
                                    <i class="icon-details"></i>
                                </a>
                                <a href="javascript:;" data-id="11947435" data-awemeid="6909307306790391053"
                                   data-active="fans" class="fans-analysis js-open-aweme-pop" data-toggle="tooltip"
                                   data-placement="top" title="视频观众分析"><i class="icon-fans" aria-hidden="true"></i></a>
                                <a href="https://www.douyin.com/share/video/6909307306790391053/?mid=6909307475753814792"
                                   target="_blank" class="source-play" data-toggle="tooltip" data-placement="top"
                                   title="播放">
                                    <i class="icon-play" aria-hidden="true"></i>
                                </a>
                                <a href="javascript:;" class="source-collection js-list-sync-aweme" data-id="1481467897"
                                   data-toggle="tooltip" data-placement="top" title="收藏">
                                    <i class="icon-star1"></i>
                                </a>

                            </div>
                        </td>
                    </tr>


                    </tbody>
                </table>
            </div>
            <div class="page-load" id="js-pager-end" style="display:none;">没有更多了~</div>
            <div id="js-pager-limit"></div>
        </div>
</div>

</body>
</html>

解析代码如下:

package test;

import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;

import java.io.File;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class Demo {

    /**
     * 在元素中查找复合要求的属性值的元素
     *
     * @param e     元素
     * @param attr  属性
     * @param value 值
     * @return 元素
     */
    private Element getElementByAttribute(Element e, String attr, String value) {
        List<Element> elements = e.elements();
        for (Element ele : elements) {
            if (ele.attribute(attr) != null && ele.attributeValue(attr).equals(value)) {
                return ele;
            }
        }
        return null;
    }

    /**
     * 获取视频信息
     *
     * @param file 文件
     * @return 视频信息数组
     */
    public List<String[]> getMovieInfos(File file) {
        ArrayList<String[]> ret = new ArrayList<>();
        SAXReader xmlReader = new SAXReader();
        try {
            // 找元素
            Document document = xmlReader.read(file);
            Element page = document.getRootElement().element("body").elements().get(0);
            Element article = getElementByAttribute(page, "id", "article");
            Element element = article.elements().get(2);
            Element tbody = element.element("table").element("tbody");
            List<Element> elements = tbody.elements();
         
            for (int i = 0; i < elements.size(); i++) {
                // 获取 传播指数、视频内容、视频时长、博主、点赞数、评论数
                String[] movieInfos = new String[6];
                Element tr = elements.get(i);
                String spread = tr.elements().get(0).element("span").getText().trim();
                String img = tr.elements().get(1).element("div").element("div").element("img").attributeValue("src").trim();
                String author = tr.elements().get(2).element("div").element("div").element("a").getText().trim();
                String time = tr.elements().get(2).element("div").element("div").element("div").getText().trim();
                String like = tr.elements().get(3).getText().trim();
                String comment = tr.elements().get(4).getText().trim();
                movieInfos[0] = spread;
                movieInfos[1] = img;
                movieInfos[2] = author;
                movieInfos[3] = time;
                movieInfos[4] = like;
                movieInfos[5] = comment;
                ret.add(movieInfos);
            }
            return ret;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return new ArrayList<>();
    }

    public static void main(String[] args) {
        Demo demo = new Demo();
        List<String[]> movieInfos = demo.getMovieInfos(new File("test.html"));
        for (String[] movieInfo : movieInfos) {
            System.out.println(Arrays.toString(movieInfo));
        }
    }
}

执行结果:

dom jquery 解析 dom4j解析html_html