提取word文档评论和他们评论的文字
Extract word document comments and the text they comment on
我需要提取word文档评论和他们评论的文字。以下是我当前的解决方案,但它没有按预期工作
public class Main {
public static void main(String[] args) throws Exception {
var document = new Document("sample.docx");
NodeCollection<Paragraph> paragraphs = document.getChildNodes(PARAGRAPH, true);
List<MyComment> myComments = new ArrayList<>();
for (Paragraph paragraph : paragraphs) {
var comments = getComments(paragraph);
int commentIndex = 0;
if (comments.isEmpty()) continue;
for (Run run : paragraph.getRuns()) {
var runText = run.getText();
for (int i = commentIndex; i < comments.size(); i++) {
Comment comment = comments.get(i);
String commentText = comment.getText();
if (paragraph.getText().contains(runText + commentText)) {
myComments.add(new MyComment(runText, commentText));
commentIndex++;
break;
}
}
}
}
myComments.forEach(System.out::println);
}
private static List<Comment> getComments(Paragraph paragraph) {
@SuppressWarnings("unchecked")
NodeCollection<Comment> comments = paragraph.getChildNodes(COMMENT, false);
List<Comment> commentList = new ArrayList<>();
comments.forEach(commentList::add);
return commentList;
}
static class MyComment {
String text;
String commentText;
public MyComment(String text, String commentText) {
this.text = text;
this.commentText = commentText;
}
@Override
public String toString() {
return text + "-->" + commentText;
}
}
}
sample.docx内容为:
输出是(不正确的):
factors-->This is word comment
%–10% of cancers are caused by inherited genetic defects from a person's parents.-->Second paragraph comment
预期输出为:
factors-->This is word comment
These factors act, at least partly, by changing the genes of a cell. Typically, many genetic changes are required before cancer develops. Approximately 5%–10% of cancers are caused by inherited genetic defects from a person's parents.-->Second paragraph comment
These factors act, at least partly, by changing the genes of a cell. Typically, many genetic changes are required before cancer develops. Approximately 5%–10% of cancers are caused by inherited genetic defects from a person's parents.-->First paragraph comment
请帮助我更好地提取 word 文档评论和他们评论的文本。如果您需要更多详细信息,请告诉我,我会提供所有必需的详细信息
注释文本用特殊节点标记CommentRangeStart and CommentRangeEnd。 CommentRangeStart 和 CommentRangeEnd 节点具有 Id,它对应于范围链接到的 Comment id。所以需要提取对应起始节点和结束节点之间的内容。
顺便说一下,Aspose.Words API 参考中的代码示例显示了如何使用文档访问器打印所有注释的内容及其注释范围。看起来正是您要找的东西。
编辑:您可以使用如下代码来完成您的任务。我没有提供用于在节点之间提取内容的完整代码,在 GitHub
上可用
Document doc = new Document("C:\Temp\in.docx");
// Get the comments in the document.
Iterable<Comment> comments = doc.getChildNodes(NodeType.COMMENT, true);
Iterable<CommentRangeStart> commentRangeStarts = doc.getChildNodes(NodeType.COMMENT_RANGE_START, true);
Iterable<CommentRangeEnd> commentRangeEnds = doc.getChildNodes(NodeType.COMMENT_RANGE_END, true);
for (Comment c : comments)
{
System.out.println(String.format("Comment %d : %s", c.getId(), c.toString(SaveFormat.TEXT)));
CommentRangeStart start = null;
CommentRangeEnd end = null;
// Search for an appropriate start and end.
for (CommentRangeStart s : commentRangeStarts)
{
if (c.getId() == s.getId())
{
start = s;
break;
}
}
for (CommentRangeEnd e : commentRangeEnds)
{
if (c.getId() == e.getId())
{
end = e;
break;
}
}
if (start != null && end != null)
{
// Extract content between the start and end nodes.
// Code example how to extract content between nodes is here
// https://github.com/aspose-words/Aspose.Words-for-Java/blob/master/Examples/src/main/java/com/aspose/words/examples/programming_documents/document/ExtractContentBetweenCommentRange.java
}
else
{
System.out.println(String.format("Comment %d Does not have comment range"));
}
}
我需要提取word文档评论和他们评论的文字。以下是我当前的解决方案,但它没有按预期工作
public class Main {
public static void main(String[] args) throws Exception {
var document = new Document("sample.docx");
NodeCollection<Paragraph> paragraphs = document.getChildNodes(PARAGRAPH, true);
List<MyComment> myComments = new ArrayList<>();
for (Paragraph paragraph : paragraphs) {
var comments = getComments(paragraph);
int commentIndex = 0;
if (comments.isEmpty()) continue;
for (Run run : paragraph.getRuns()) {
var runText = run.getText();
for (int i = commentIndex; i < comments.size(); i++) {
Comment comment = comments.get(i);
String commentText = comment.getText();
if (paragraph.getText().contains(runText + commentText)) {
myComments.add(new MyComment(runText, commentText));
commentIndex++;
break;
}
}
}
}
myComments.forEach(System.out::println);
}
private static List<Comment> getComments(Paragraph paragraph) {
@SuppressWarnings("unchecked")
NodeCollection<Comment> comments = paragraph.getChildNodes(COMMENT, false);
List<Comment> commentList = new ArrayList<>();
comments.forEach(commentList::add);
return commentList;
}
static class MyComment {
String text;
String commentText;
public MyComment(String text, String commentText) {
this.text = text;
this.commentText = commentText;
}
@Override
public String toString() {
return text + "-->" + commentText;
}
}
}
sample.docx内容为:
输出是(不正确的):
factors-->This is word comment
%–10% of cancers are caused by inherited genetic defects from a person's parents.-->Second paragraph comment
预期输出为:
factors-->This is word comment
These factors act, at least partly, by changing the genes of a cell. Typically, many genetic changes are required before cancer develops. Approximately 5%–10% of cancers are caused by inherited genetic defects from a person's parents.-->Second paragraph comment
These factors act, at least partly, by changing the genes of a cell. Typically, many genetic changes are required before cancer develops. Approximately 5%–10% of cancers are caused by inherited genetic defects from a person's parents.-->First paragraph comment
请帮助我更好地提取 word 文档评论和他们评论的文本。如果您需要更多详细信息,请告诉我,我会提供所有必需的详细信息
注释文本用特殊节点标记CommentRangeStart and CommentRangeEnd。 CommentRangeStart 和 CommentRangeEnd 节点具有 Id,它对应于范围链接到的 Comment id。所以需要提取对应起始节点和结束节点之间的内容。 顺便说一下,Aspose.Words API 参考中的代码示例显示了如何使用文档访问器打印所有注释的内容及其注释范围。看起来正是您要找的东西。
编辑:您可以使用如下代码来完成您的任务。我没有提供用于在节点之间提取内容的完整代码,在 GitHub
上可用Document doc = new Document("C:\Temp\in.docx");
// Get the comments in the document.
Iterable<Comment> comments = doc.getChildNodes(NodeType.COMMENT, true);
Iterable<CommentRangeStart> commentRangeStarts = doc.getChildNodes(NodeType.COMMENT_RANGE_START, true);
Iterable<CommentRangeEnd> commentRangeEnds = doc.getChildNodes(NodeType.COMMENT_RANGE_END, true);
for (Comment c : comments)
{
System.out.println(String.format("Comment %d : %s", c.getId(), c.toString(SaveFormat.TEXT)));
CommentRangeStart start = null;
CommentRangeEnd end = null;
// Search for an appropriate start and end.
for (CommentRangeStart s : commentRangeStarts)
{
if (c.getId() == s.getId())
{
start = s;
break;
}
}
for (CommentRangeEnd e : commentRangeEnds)
{
if (c.getId() == e.getId())
{
end = e;
break;
}
}
if (start != null && end != null)
{
// Extract content between the start and end nodes.
// Code example how to extract content between nodes is here
// https://github.com/aspose-words/Aspose.Words-for-Java/blob/master/Examples/src/main/java/com/aspose/words/examples/programming_documents/document/ExtractContentBetweenCommentRange.java
}
else
{
System.out.println(String.format("Comment %d Does not have comment range"));
}
}