Java: Apache PDFbox 提取突出显示的文本
Java: Apache PDFbox Extract highlighted text
我正在使用 Apache PDFbox 库从 PDF 文件中提取突出显示的文本(即黄色背景)。我是这个库的新手,不知道其中的哪个 class 用于此目的。
到目前为止,我已经使用以下代码从评论中提取了文本。
PDDocument pddDocument = PDDocument.load(new File("test.pdf"));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
int pageNum = i + 1;
PDPage page = (PDPage) allPages.get(i);
List<PDAnnotation> la = page.getAnnotations();
if (la.size() < 1) {
continue;
}
System.out.println("Total annotations = " + la.size());
System.out.println("\nProcess Page " + pageNum + "...");
// Just get the first annotation for testing
PDAnnotation pdfAnnot = la.get(0);
System.out.println("Getting text from comment = " + pdfAnnot.getContents());
现在我需要获取突出显示的文本,非常感谢任何代码示例。
问题 Not able to read the exact text highlighted across the lines 中的代码已经说明了用于使用 PDFBox 从页面上的有限内容区域中提取文本的大多数概念。
看了这段代码,OP还在评论中疑惑:
But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts?
通常注释所指的区域是一个矩形:
Rect rectangle (Required) The annotation rectangle, defining the location of the annotation on the page in default user space units.
(from Table 164 – Entries common to all annotation dictionaries - in ISO 32000-1)
对于某些注释类型(例如文本标记),此位置值不够,因为:
- 标记的文本可能会以一些奇怪的角度书写,但规范中提到的 rectangle 类型指的是边缘与页面边缘平行的矩形;和
- 标记的文本可以在一行中的任意位置开始并在另一行中的任意位置结束,因此标记区域根本不是矩形,而是多个矩形部分的并集。
因此,为了应对此类注释类型,PDF 规范提供了更通用的区域定义方式:
QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order
x1 y1 x2 y2 x3 y3 x4 y4
specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x1, y1) and (x2, y2).
(from Table 179 – Additional entries specific to text markup annotations - in ISO 32000-1)
因此,而不是
给出的矩形
PDRectangle rect = pdfAnnot.getRectangle();
在referenced question中的代码中,你要考虑
给出的四边形
COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));
并相应地为 PDFTextStripperByArea stripper
定义区域。不幸的是 PDFTextStripperByArea.addRegion
需要一个矩形作为参数,而不是一些通用的四边形。由于文本通常水平或垂直打印,因此不会造成太大问题。
PS 关于 QuadPoints 规范的一个警告,实际 PDF 中的顺序可能不同,请参见.问题 PDF Spec vs Acrobat creation (QuadPoints).
希望这个回答对遇到同样问题的大家有所帮助。
// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
@SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
ArrayList<String> highlightedTexts = new ArrayList<>();
// this is the in-memory representation of the PDF document.
// this will load a document from a file.
PDDocument document = PDDocument.load(filePath);
// this represents all pages in a PDF document.
List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
// this represents a single page in a PDF document.
PDPage page = allPages.get(pageNumber);
// get annotation dictionaries
List<PDAnnotation> annotations = page.getAnnotations();
for(int i=0; i<annotations.size(); i++) {
// check subType
if(annotations.get(i).getSubtype().equals("Highlight")) {
// extract highlighted text
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();
COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
String str = null;
for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {
COSFloat ULX = (COSFloat) quadsArray.get(0+k);
COSFloat ULY = (COSFloat) quadsArray.get(1+k);
COSFloat URX = (COSFloat) quadsArray.get(2+k);
COSFloat URY = (COSFloat) quadsArray.get(3+k);
COSFloat LLX = (COSFloat) quadsArray.get(4+k);
COSFloat LLY = (COSFloat) quadsArray.get(5+k);
COSFloat LRX = (COSFloat) quadsArray.get(6+k);
COSFloat LRY = (COSFloat) quadsArray.get(7+k);
k+=8;
float ulx = ULX.floatValue() - 1; // upper left x.
float uly = ULY.floatValue(); // upper left y.
float width = URX.floatValue() - LLX.floatValue(); // calculated by upperRightX - lowerLeftX.
float height = URY.floatValue() - LLY.floatValue(); // calculated by upperRightY - lowerLeftY.
PDRectangle pageSize = page.getMediaBox();
uly = pageSize.getHeight() - uly;
Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
stripperByArea.addRegion("highlightedRegion", rectangle_2);
stripperByArea.extractRegions(page);
String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");
if(j > 1) {
str = str.concat(highlightedText);
} else {
str = highlightedText;
}
}
highlightedTexts.add(str);
}
}
document.close();
return highlightedTexts;
}
我正在使用 Apache PDFbox 库从 PDF 文件中提取突出显示的文本(即黄色背景)。我是这个库的新手,不知道其中的哪个 class 用于此目的。 到目前为止,我已经使用以下代码从评论中提取了文本。
PDDocument pddDocument = PDDocument.load(new File("test.pdf"));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
int pageNum = i + 1;
PDPage page = (PDPage) allPages.get(i);
List<PDAnnotation> la = page.getAnnotations();
if (la.size() < 1) {
continue;
}
System.out.println("Total annotations = " + la.size());
System.out.println("\nProcess Page " + pageNum + "...");
// Just get the first annotation for testing
PDAnnotation pdfAnnot = la.get(0);
System.out.println("Getting text from comment = " + pdfAnnot.getContents());
现在我需要获取突出显示的文本,非常感谢任何代码示例。
问题 Not able to read the exact text highlighted across the lines 中的代码已经说明了用于使用 PDFBox 从页面上的有限内容区域中提取文本的大多数概念。
看了这段代码,OP还在评论中疑惑:
But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts?
通常注释所指的区域是一个矩形:
Rect rectangle (Required) The annotation rectangle, defining the location of the annotation on the page in default user space units.
(from Table 164 – Entries common to all annotation dictionaries - in ISO 32000-1)
对于某些注释类型(例如文本标记),此位置值不够,因为:
- 标记的文本可能会以一些奇怪的角度书写,但规范中提到的 rectangle 类型指的是边缘与页面边缘平行的矩形;和
- 标记的文本可以在一行中的任意位置开始并在另一行中的任意位置结束,因此标记区域根本不是矩形,而是多个矩形部分的并集。
因此,为了应对此类注释类型,PDF 规范提供了更通用的区域定义方式:
QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order
x1 y1 x2 y2 x3 y3 x4 y4
specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x1, y1) and (x2, y2).
(from Table 179 – Additional entries specific to text markup annotations - in ISO 32000-1)
因此,而不是
给出的矩形PDRectangle rect = pdfAnnot.getRectangle();
在referenced question中的代码中,你要考虑
给出的四边形COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));
并相应地为 PDFTextStripperByArea stripper
定义区域。不幸的是 PDFTextStripperByArea.addRegion
需要一个矩形作为参数,而不是一些通用的四边形。由于文本通常水平或垂直打印,因此不会造成太大问题。
PS 关于 QuadPoints 规范的一个警告,实际 PDF 中的顺序可能不同,请参见.问题 PDF Spec vs Acrobat creation (QuadPoints).
希望这个回答对遇到同样问题的大家有所帮助。
// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
@SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
ArrayList<String> highlightedTexts = new ArrayList<>();
// this is the in-memory representation of the PDF document.
// this will load a document from a file.
PDDocument document = PDDocument.load(filePath);
// this represents all pages in a PDF document.
List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
// this represents a single page in a PDF document.
PDPage page = allPages.get(pageNumber);
// get annotation dictionaries
List<PDAnnotation> annotations = page.getAnnotations();
for(int i=0; i<annotations.size(); i++) {
// check subType
if(annotations.get(i).getSubtype().equals("Highlight")) {
// extract highlighted text
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();
COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
String str = null;
for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {
COSFloat ULX = (COSFloat) quadsArray.get(0+k);
COSFloat ULY = (COSFloat) quadsArray.get(1+k);
COSFloat URX = (COSFloat) quadsArray.get(2+k);
COSFloat URY = (COSFloat) quadsArray.get(3+k);
COSFloat LLX = (COSFloat) quadsArray.get(4+k);
COSFloat LLY = (COSFloat) quadsArray.get(5+k);
COSFloat LRX = (COSFloat) quadsArray.get(6+k);
COSFloat LRY = (COSFloat) quadsArray.get(7+k);
k+=8;
float ulx = ULX.floatValue() - 1; // upper left x.
float uly = ULY.floatValue(); // upper left y.
float width = URX.floatValue() - LLX.floatValue(); // calculated by upperRightX - lowerLeftX.
float height = URY.floatValue() - LLY.floatValue(); // calculated by upperRightY - lowerLeftY.
PDRectangle pageSize = page.getMediaBox();
uly = pageSize.getHeight() - uly;
Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
stripperByArea.addRegion("highlightedRegion", rectangle_2);
stripperByArea.extractRegions(page);
String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");
if(j > 1) {
str = str.concat(highlightedText);
} else {
str = highlightedText;
}
}
highlightedTexts.add(str);
}
}
document.close();
return highlightedTexts;
}