dom4j对包含Mixed复杂元素内容的提取,首先看个例子
- <?xml version="1.0" encoding="utf-8"?>
- <resources>
- <!-- About -->
- <string name="ABOUT_TERMS_OF_SERVICE_LINK"><a href="http://www.webex.com/terms-of-service.html">Terms of Service</a></string>
- <string name="ABOUT_PRIVACY_STATEMENT_LINK"><a href="http://www.cisco.com/web/siteassets/legal/privacy.html">Privacy Statement</a></string>
- <string name="ABOUT_THIRD_PARTY_LINK"><a href="http://www.webex.com/legal/license.html">Third Party Licenses and Notices (including free/open source software)</a></string>
- <!-- Term of use -->
- <string name="TERMSOFUSE_LINK">I have reviewed and agree to the <a href="http://m.webex.com">Terms of Service</a></string>
- <string name="TERMSOFUSE_TITLE">Cisco WebEx Meetings</string>
- <string name="TERMSOFUSE_BUTTON_OK">I accept</string>
- <string name="TERMSOFUSE_BUTTON_CANCEL">I do not accept</string>
- </resources>
现在需要提取<string name="ABOUT_THIRD_PARTY_LINK"><a href="http://www.webex.com/legal/license.html">Third Party Licenses and Notices (including free/open source software)</a></string> 中的"><a href="http://www.webex.com/legal/license.html">Third Party Licenses and Notices (including free/open source software)</a> 如果仅仅使用dom4j提供的getXX方法得到的结果多半是让人失望的。为此需要查看Element类所继承的父类和实现的接口。在它实现的Branch, Cloneable, Node三个接口中的Branch接口中声明了一个content()方法来返回一个Node类型的List列表:
Returns the content nodes of this branch as a backed
public List content()
List
so that the content of this branch may be modified directly using the List
interface. The List
is backed by the Branch
so that changes to the list are reflected in the branch and vice versa.List
因此可使用它来提取节点内容。代码如下:
- public Map<String, String> getTagOfEnglishStrFromXml(Document doc) {
- Map<String,String> map = new HashMap<String,String>();
- if(null==doc.getRootElement()||!doc.getRootElement().hasContent()){
- return null;
- }
- Element root = doc.getRootElement();
- String rootName = root.getName();
- int childNum = root.elements().size();
- if(childNum<1){
- return null;
- }
- int elementSequence = 0;
- for(int cindex=0;cindex<childNum;cindex++){/* for each string element */
- ++elementSequence;
- Element stringElem = (Element)root.elements().get(cindex);
- String TagName = "";
- /* produce the tag name by rules*/
- int attrCount = stringElem.attributeCount();
- if(attrCount<1){
- TagName = rootName+"_"+stringElem.getName()+"_"+String.valueOf(elementSequence);
- }else{
- for(int i=0;i<attrCount;i++){
- TagName+=stringElem.attribute(i).getValue();
- }
- }
- String englishStr = "";
- if(stringElem.isTextOnly()){
- englishStr = stringElem.getText();
- }else{
- List list = stringElem.content();
- Iterator iterator = list.iterator();
- while(iterator.hasNext()){
- Node node = (Node)iterator.next();
- switch(node.getNodeType()){
- case Node.ELEMENT_NODE: englishStr += node.asXML();
- break;
- case Node.TEXT_NODE: englishStr += node.getText();
- break;
- }
- }
- }
- map.put(TagName, englishStr);
- }
- return map;
- }
请特别留意以下代码
- String englishStr = "";
- if(stringElem.isTextOnly()){
- englishStr = stringElem.getText();
- }else{
- List list = stringElem.content();
- Iterator iterator = list.iterator();
- while(iterator.hasNext()){
- Node node = (Node)iterator.next();
- switch(node.getNodeType()){
- case Node.ELEMENT_NODE: englishStr += node.asXML();
- break;
- case Node.TEXT_NODE: englishStr += node.getText();
- break;
- }
- }
- }