Aspose.Words For .NET是一种高级Word文档处理API,用于执行各种文档管理和操作任务。API支持生成,修改,转换,呈现和打印文档,而无需在跨平台应用程序中直接使用Microsoft Word。此外,API支持所有流行的Word处理文件格式,并允许将Word文档导出或转换为固定布局文件格式和最常用的图像/多媒体格式。本文介绍了如何用Java从Word文档中提取文本

从 Word 文档中提取文本通常在不同的场景中执行。例如,分析文本,提取文档的特定部分并将它们组合成单个文档,等等。在本文中,您将学习如何在 Java 中以编程方式从 Word 文档中提取文本。此外,我们将介绍如何动态提取段落、表格等特定元素之间的内容。

Aspose.Words 最新下载

java aspose 去掉版权说明 aspose.words java_java aspose 去掉版权说明

https://www.evget.com/product/564

获取从 Word 文档中提取文本的 Java 库

Aspose.Words for Java 是一个功能强大的库,可让您从头开始创建 MS Word 文档。此外,它可以让您操作现有的 Word 文档进行加密、转换、文本提取等。我们将使用这个库从 Word DOCX 或 DOC 文档中提取文本。您可以下载API 的 JAR 或使用以下 Maven 配置安装它。

<repository>
<id>AsposeJavaAPI</id>
<name>Aspose Java API</name>
<url>https://repository.aspose.com/repo/</url>
</repository>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-words</artifactId>
<version>22.6</version>
<type>pom</type>
</dependency>

在Java 中提取 Word DOC/DOCX 中的文本

MS Word 文档由各种元素组成,包括段落、表格、图像等。因此,文本提取的要求可能因场景而异。例如,您可能需要在段落、书签、评论等之间提取文本。

Word DOC/DOCX 中的每种元素都表示为一个节点。因此,要处理文档,您将不得不使用节点。那么让我们开始看看如何在不同的场景下从 Word 文档中提取文本。

在 Java 中提取 Word DOC 中的文本

在本节中,我们将为 Word 文档实现一个 Java 文本提取器,文本提取的工作流程如下:

  • 首先,我们将定义要包含在文本提取过程中的节点。
  • 然后,我们将提取指定节点之间的内容(包括或不包括开始和结束节点)。
  • 最后,我们将使用提取节点的克隆,例如创建一个包含提取内容的新 Word 文档。

现在让我们编写一个名为extractContent的方法,我们将向该方法传递节点和一些其他参数来执行文本提取。此方法将解析文档并克隆节点。以下是我们将传递给此方法的参数。

  1. startNodeendNode 分别作为内容提取的起点和终点。这些可以是块级(ParagraphTable)或内联级(例如RunFieldStartBookmarkStart等)节点。
  1. 要传递一个字段,您应该传递相应的FieldStart对象。
  2. 要传递书签,应传递BookmarkStartBookmarkEnd节点。
  3. 对于评论,应使用CommentRangeStartCommentRangeEnd节点。
  1. isInclusive定义标记是否包含在提取中。如果此选项设置为 false 并且传递相同的节点或连续节点,则将返回一个空列表。

以下是提取传递的节点之间的内容的extractContent方法的完整实现。

// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java
public static ArrayList extractContent(Node startNode, Node endNode, boolean isInclusive) throws Exception {
// First check that the nodes passed to this method are valid for use.
verifyParameterNodes(startNode, endNode);

// Create a list to store the extracted nodes.
ArrayList nodes = new ArrayList();

// Keep a record of the original nodes passed to this method so we can split marker nodes if needed.
Node originalStartNode = startNode;
Node originalEndNode = endNode;

// Extract content based on block level nodes (paragraphs and tables). Traverse through parent nodes to find them.
// We will split the content of first and last nodes depending if the marker nodes are inline
while (startNode.getParentNode().getNodeType() != NodeType.BODY)
startNode = startNode.getParentNode();

while (endNode.getParentNode().getNodeType() != NodeType.BODY)
endNode = endNode.getParentNode();

boolean isExtracting = true;
boolean isStartingNode = true;
boolean isEndingNode;
// The current node we are extracting from the document.
Node currNode = startNode;

// Begin extracting content. Process all block level nodes and specifically split the first and last nodes when needed so paragraph formatting is retained.
// Method is little more complex than a regular extractor as we need to factor in extracting using inline nodes, fields, bookmarks etc as to make it really useful.
while (isExtracting) {
// Clone the current node and its children to obtain a copy.
/*System.out.println(currNode.getNodeType());
if(currNode.getNodeType() == NodeType.EDITABLE_RANGE_START
|| currNode.getNodeType() == NodeType.EDITABLE_RANGE_END)
{
currNode = currNode.nextPreOrder(currNode.getDocument());
}*/
System.out.println(currNode);
System.out.println(endNode);

CompositeNode cloneNode = null;
///cloneNode = (CompositeNode) currNode.deepClone(true);

Node inlineNode = null;
if(currNode.isComposite())
{
cloneNode = (CompositeNode) currNode.deepClone(true);
}
else
{
if(currNode.getNodeType() == NodeType.BOOKMARK_END)
{
Paragraph paragraph = new Paragraph(currNode.getDocument());
paragraph.getChildNodes().add(currNode.deepClone(true));
cloneNode = (CompositeNode)paragraph.deepClone(true);
}
}

isEndingNode = currNode.equals(endNode);

if (isStartingNode || isEndingNode) {
// We need to process each marker separately so pass it off to a separate method instead.
if (isStartingNode) {
processMarker(cloneNode, nodes, originalStartNode, isInclusive, isStartingNode, isEndingNode);
isStartingNode = false;
}

// Conditional needs to be separate as the block level start and end markers maybe the same node.
if (isEndingNode) {
processMarker(cloneNode, nodes, originalEndNode, isInclusive, isStartingNode, isEndingNode);
isExtracting = false;
}
} else
// Node is not a start or end marker, simply add the copy to the list.
nodes.add(cloneNode);

// Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section.
if (currNode.getNextSibling() == null && isExtracting) {
// Move to the next section.
Section nextSection = (Section) currNode.getAncestor(NodeType.SECTION).getNextSibling();
currNode = nextSection.getBody().getFirstChild();
} else {
// Move to the next node in the body.
currNode = currNode.getNextSibling();
}
}

// Return the nodes between the node markers.
return nodes;
}

extractContent方法还需要一些辅助方法来完成文本提取操作,如下所示。

/**
* Checks the input parameters are correct and can be used. Throws an exception
* if there is any problem.
*/
private static void verifyParameterNodes(Node startNode, Node endNode) throws Exception {
// The order in which these checks are done is important.
if (startNode == null)
throw new IllegalArgumentException("Start node cannot be null");
if (endNode == null)
throw new IllegalArgumentException("End node cannot be null");

if (!startNode.getDocument().equals(endNode.getDocument()))
throw new IllegalArgumentException("Start node and end node must belong to the same document");

if (startNode.getAncestor(NodeType.BODY) == null || endNode.getAncestor(NodeType.BODY) == null)
throw new IllegalArgumentException("Start node and end node must be a child or descendant of a body");

// Check the end node is after the start node in the DOM tree
// First check if they are in different sections, then if they're not check
// their position in the body of the same section they are in.
Section startSection = (Section) startNode.getAncestor(NodeType.SECTION);
Section endSection = (Section) endNode.getAncestor(NodeType.SECTION);

int startIndex = startSection.getParentNode().indexOf(startSection);
int endIndex = endSection.getParentNode().indexOf(endSection);

if (startIndex == endIndex) {
if (startSection.getBody().indexOf(startNode) > endSection.getBody().indexOf(endNode))
throw new IllegalArgumentException("The end node must be after the start node in the body");
} else if (startIndex > endIndex)
throw new IllegalArgumentException("The section of end node must be after the section start node");
}

/**
* Checks if a node passed is an inline node.
*/
private static boolean isInline(Node node) throws Exception {
// Test if the node is desendant of a Paragraph or Table node and also is not a
// paragraph or a table a paragraph inside a comment class which is decesant of
// a pararaph is possible.
return ((node.getAncestor(NodeType.PARAGRAPH) != null || node.getAncestor(NodeType.TABLE) != null)
&& !(node.getNodeType() == NodeType.PARAGRAPH || node.getNodeType() == NodeType.TABLE));
}

/**
* Removes the content before or after the marker in the cloned node depending
* on the type of marker.
*/
private static void processMarker(CompositeNode cloneNode, ArrayList nodes, Node node, boolean isInclusive,
boolean isStartMarker, boolean isEndMarker) throws Exception {
// If we are dealing with a block level node just see if it should be included
// and add it to the list.
if (!isInline(node)) {
// Don't add the node twice if the markers are the same node
if (!(isStartMarker && isEndMarker)) {
if (isInclusive)
nodes.add(cloneNode);
}
return;
}

// If a marker is a FieldStart node check if it's to be included or not.
// We assume for simplicity that the FieldStart and FieldEnd appear in the same
// paragraph.
if (node.getNodeType() == NodeType.FIELD_START) {
// If the marker is a start node and is not be included then skip to the end of
// the field.
// If the marker is an end node and it is to be included then move to the end
// field so the field will not be removed.
if ((isStartMarker && !isInclusive) || (!isStartMarker && isInclusive)) {
while (node.getNextSibling() != null && node.getNodeType() != NodeType.FIELD_END)
node = node.getNextSibling();

}
}

// If either marker is part of a comment then to include the comment itself we
// need to move the pointer forward to the Comment
// node found after the CommentRangeEnd node.
if (node.getNodeType() == NodeType.COMMENT_RANGE_END) {
while (node.getNextSibling() != null && node.getNodeType() != NodeType.COMMENT)
node = node.getNextSibling();

}

// Find the corresponding node in our cloned node by index and return it.
// If the start and end node are the same some child nodes might already have
// been removed. Subtract the
// difference to get the right index.
int indexDiff = node.getParentNode().getChildNodes().getCount() - cloneNode.getChildNodes().getCount();

// Child node count identical.
if (indexDiff == 0)
node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node));
else
node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node) - indexDiff);

// Remove the nodes up to/from the marker.
boolean isSkip;
boolean isProcessing = true;
boolean isRemoving = isStartMarker;
Node nextNode = cloneNode.getFirstChild();

while (isProcessing && nextNode != null) {
Node currentNode = nextNode;
isSkip = false;

if (currentNode.equals(node)) {
if (isStartMarker) {
isProcessing = false;
if (isInclusive)
isRemoving = false;
} else {
isRemoving = true;
if (isInclusive)
isSkip = true;
}
}

nextNode = nextNode.getNextSibling();
if (isRemoving && !isSkip)
currentNode.remove();
}

// After processing the composite node may become empty. If it has don't include
// it.
if (!(isStartMarker && isEndMarker)) {
if (cloneNode.hasChildNodes())
nodes.add(cloneNode);
}
}

public static Document generateDocument(Document srcDoc, ArrayList nodes) throws Exception {

// Create a blank document.
Document dstDoc = new Document();
// Remove the first paragraph from the empty document.
dstDoc.getFirstSection().getBody().removeAllChildren();

// Import each node from the list into the new document. Keep the original
// formatting of the node.
NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);

for (Node node : (Iterable<Node>) nodes) {
Node importNode = importer.importNode(node, true);
dstDoc.getFirstSection().getBody().appendChild(importNode);
}

// Return the generated document.
return dstDoc;
}

现在我们准备好使用这些方法并从 Word 文档中提取文本。

在Java 提取 Word DOC 中段落之间的文本

让我们看看如何在 Word DOCX 文档的两个段落之间提取内容。以下是在 Java 中执行此操作的步骤。

  • 首先,使用Document类加载 Word 文档。
  • 使用Document.getFirstSection().getChild(NodeType.PARAGRAPH, int, bool)方法将开始和结束段落的引用获取到两个对象中。
  • 调用extractContent(startPara, endPara, true)方法将节点提取到对象中。
  • 调用generateDocument(Document, extractNodes)辅助方法来创建包含提取内容的文档。
  • 最后,使用Document.save(String)方法保存返回的文档。

以下代码示例展示了如何在 Java 的 Word DOCX 中提取第 7 段和第 11 段之间的文本。

// Load document
Document doc = new Document("TestFile.doc");

// Gather the nodes. The GetChild method uses 0-based index
Paragraph startPara = (Paragraph) doc.getFirstSection().getChild(NodeType.PARAGRAPH, 6, true);
Paragraph endPara = (Paragraph) doc.getFirstSection().getChild(NodeType.PARAGRAPH, 10, true);
// Extract the content between these nodes in the document. Include these
// markers in the extraction.
ArrayList extractedNodes = extractContent(startPara, endPara, true);

// Insert the content into a new separate document and save it to disk.
Document dstDoc = generateDocument(doc, extractedNodes);
dstDoc.save("output.doc");

在Java 中提取 DOC 中文本 - 在不同类型的节点之间

您还可以在不同类型的节点之间提取内容。为了演示,让我们提取段落和表格之间的内容并将其保存到新的 Word 文档中。以下是在 Java 中提取 Word 文档中不同节点之间的文本的步骤。

  • 使用Document类加载 Word 文档。
  • 使用Document.getFirstSection().getChild(NodeType, int, bool)方法将起始节点和结束节点引用到两个对象中。
  • 调用extractContent(startPara, endPara, true)方法将节点提取到对象中。
  • 调用generateDocument(Document, extractNodes)辅助方法来创建包含提取内容的文档。
  • 使用Document.save(String)方法保存返回的文档。

以下代码示例展示了如何使用 Java 在 DOCX 中提取段落和表格之间的文本。

// Load documents
Document doc = new Document("TestFile.doc");

// Get reference of starting paragraph
Paragraph startPara = (Paragraph) doc.getLastSection().getChild(NodeType.PARAGRAPH, 2, true);
Table endTable = (Table) doc.getLastSection().getChild(NodeType.TABLE, 0, true);

// Extract the content between these nodes in the document. Include these markers in the extraction.
ArrayList extractedNodes = extractContent(startPara, endTable, true);

// Lets reverse the array to make inserting the content back into the document easier.
Collections.reverse(extractedNodes);

while (extractedNodes.size() > 0) {
// Insert the last node from the reversed list
endTable.getParentNode().insertAfter((Node) extractedNodes.get(0), endTable);
// Remove this node from the list after insertion.
extractedNodes.remove(0);
}

// Save the generated document to disk.
doc.save("output.doc");

在Java 中提取 DOCX 中文本 - 基于样式的段落之间

现在让我们看看如何根据样式提取段落之间的内容。为了演示,我们将提取 Word 文档中第一个“标题 1”和第一个“标题 3”之间的内容。以下步骤演示了如何在 Java 中实现此目的。

  • 首先,使用Document类加载 Word 文档。
  • 然后,使用paragraphsByStyleName(Document, “Heading 1”)辅助方法将段落提取到一个对象中。
  • 使用paragraphsByStyleName(Document, “Heading 3”)辅助方法将段落提取到另一个对象中。
  • 调用extractContent(startPara, endPara, true)方法并将两个段落数组中的第一个元素作为第一个和第二个参数传递。
  • 调用generateDocument(Document, extractNodes)辅助方法来创建包含提取内容的文档。
  • 最后,使用Document.save(String)方法保存返回的文档。

以下代码示例展示了如何根据样式提取段落之间的内容。

// Load document
Document doc = new Document(dataDir + "TestFile.doc");

// Gather a list of the paragraphs using the respective heading styles.
ArrayList parasStyleHeading1 = paragraphsByStyleName(doc, "Heading 1");
ArrayList parasStyleHeading3 = paragraphsByStyleName(doc, "Heading 3");

// Use the first instance of the paragraphs with those styles.
Node startPara1 = (Node) parasStyleHeading1.get(0);
Node endPara1 = (Node) parasStyleHeading3.get(0);

// Extract the content between these nodes in the document. Don't include these markers in the extraction.
ArrayList extractedNodes = extractContent(startPara1, endPara1, false);

// Insert the content into a new separate document and save it to disk.
Document dstDoc = generateDocument(doc, extractedNodes);
dstDoc.save("output.doc");

以上便是如何用Java从Word文档中提取文本