PDFBox:如何确定矢量图(路径形状)的边界框
PDFBox: how to determine bounding box of vector figure (path shape)
我有 Apache FOP + jEuclid 生成的简单 PDF。此 PDF 包含数学公式和文本的矢量图形:
Link 到 PDF:https://www.dropbox.com/s/w4ksnud78bu9oz5/test.pdf?dl=0
我想知道每个矢量图形的边界框 (x,y,width,height)。我试过这个例子:https://svn.apache.org/repos/asf/pdfbox/tags/2.0.24/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java,但它没有输出任何信息,只有这个:
Processing page: 1
在 Acrobat 中,我可以 select 标签树中的矢量图像并突出显示它们:
我的问题 - 如何通过 PDFBox 确定矢量图像的边界框 API?
只要相关图形被适当标记(就像它们在您的示例文档中一样),您就可以根据 PDFBox PDFGraphicsStreamEngine
.
确定它们的边界框
其实可以利用BoundingBoxFinder
from (基于PDFGraphicsStreamEngine
)来决定一个页面所有内容的边界框,你只需要获取标记内容的边界框信息按标记内容顺序排列。
以下 class 通过将边界框信息存储在 MarkedContext
个对象的层次结构中来实现
public class MarkedContentBoundingBoxFinder extends BoundingBoxFinder {
public MarkedContentBoundingBoxFinder(PDPage page) {
super(page);
contents.add(content);
}
@Override
public void processPage(PDPage page) throws IOException {
super.processPage(page);
endMarkedContentSequence();
}
@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
MarkedContent current = contents.getLast();
if (rectangle != null) {
if (current.boundingBox != null)
add(current.boundingBox);
current.boundingBox = rectangle;
}
rectangle = null;
MarkedContent newContent = new MarkedContent(tag, properties);
contents.addLast(newContent);
current.children.add(newContent);
super.beginMarkedContentSequence(tag, properties);
}
@Override
public void endMarkedContentSequence() {
MarkedContent current = contents.removeLast();
if (rectangle != null) {
if (current.boundingBox != null)
add(current.boundingBox);
current.boundingBox = (Rectangle2D) rectangle.clone();
} else if (current.boundingBox != null)
rectangle = (Rectangle2D) current.boundingBox.clone();
super.endMarkedContentSequence();
}
public static class MarkedContent {
public MarkedContent(COSName tag, COSDictionary properties) {
this.tag = tag;
this.properties = properties;
}
public final COSName tag;
public final COSDictionary properties;
public final List<MarkedContent> children = new ArrayList<>();
public Rectangle2D boundingBox = null;
}
public final MarkedContent content = new MarkedContent(COSName.DOCUMENT, null);
public final Deque<MarkedContent> contents = new ArrayDeque<>();
}
(MarkedContentBoundingBoxFinder实用程序class)
您可以像这样将它应用到 PDPage pdPage
MarkedContentBoundingBoxFinder boxFinder = new MarkedContentBoundingBoxFinder(pdPage);
boxFinder.processPage(pdPage);
MarkedContent markedContent = boxFinder.content;
(摘自DetermineBoundingBox辅助方法drawMarkedContentBoundingBoxes
)
您可以像这样从 markedContent
对象输出边界框:
void printMarkedContentBoundingBoxes(MarkedContent markedContent, String prefix) {
StringBuilder builder = new StringBuilder();
builder.append(prefix).append(markedContent.tag.getName());
builder.append(' ').append(markedContent.boundingBox);
System.out.println(builder.toString());
for (MarkedContent child : markedContent.children)
printMarkedContentBoundingBoxes(child, prefix + " ");
}
(DetermineBoundingBox辅助方法)
如果是您的示例文档,您会得到
Document java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=128.63946533203125,h=10.2509765625]
Figure java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=44.6771240234375,h=10.2509765625]
P java.awt.geom.Rectangle2D$Double[x=136.79600524902344,y=760.1184081963065,w=43.137100359018405,h=6.383056943803922]
Figure java.awt.geom.Rectangle2D$Double[x=184.2926788330078,y=758.10498046875,w=34.70478820800781,h=10.2509765625]
同样,您可以使用 DetermineBoundingBox 的 drawMarkedContentBoundingBoxes
方法在 PDF 上绘制边界框。对于您的示例文档,您会得到:
我有 Apache FOP + jEuclid 生成的简单 PDF。此 PDF 包含数学公式和文本的矢量图形:
Link 到 PDF:https://www.dropbox.com/s/w4ksnud78bu9oz5/test.pdf?dl=0
我想知道每个矢量图形的边界框 (x,y,width,height)。我试过这个例子:https://svn.apache.org/repos/asf/pdfbox/tags/2.0.24/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java,但它没有输出任何信息,只有这个:
Processing page: 1
在 Acrobat 中,我可以 select 标签树中的矢量图像并突出显示它们:
我的问题 - 如何通过 PDFBox 确定矢量图像的边界框 API?
只要相关图形被适当标记(就像它们在您的示例文档中一样),您就可以根据 PDFBox PDFGraphicsStreamEngine
.
其实可以利用BoundingBoxFinder
from PDFGraphicsStreamEngine
)来决定一个页面所有内容的边界框,你只需要获取标记内容的边界框信息按标记内容顺序排列。
以下 class 通过将边界框信息存储在 MarkedContext
个对象的层次结构中来实现
public class MarkedContentBoundingBoxFinder extends BoundingBoxFinder {
public MarkedContentBoundingBoxFinder(PDPage page) {
super(page);
contents.add(content);
}
@Override
public void processPage(PDPage page) throws IOException {
super.processPage(page);
endMarkedContentSequence();
}
@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
MarkedContent current = contents.getLast();
if (rectangle != null) {
if (current.boundingBox != null)
add(current.boundingBox);
current.boundingBox = rectangle;
}
rectangle = null;
MarkedContent newContent = new MarkedContent(tag, properties);
contents.addLast(newContent);
current.children.add(newContent);
super.beginMarkedContentSequence(tag, properties);
}
@Override
public void endMarkedContentSequence() {
MarkedContent current = contents.removeLast();
if (rectangle != null) {
if (current.boundingBox != null)
add(current.boundingBox);
current.boundingBox = (Rectangle2D) rectangle.clone();
} else if (current.boundingBox != null)
rectangle = (Rectangle2D) current.boundingBox.clone();
super.endMarkedContentSequence();
}
public static class MarkedContent {
public MarkedContent(COSName tag, COSDictionary properties) {
this.tag = tag;
this.properties = properties;
}
public final COSName tag;
public final COSDictionary properties;
public final List<MarkedContent> children = new ArrayList<>();
public Rectangle2D boundingBox = null;
}
public final MarkedContent content = new MarkedContent(COSName.DOCUMENT, null);
public final Deque<MarkedContent> contents = new ArrayDeque<>();
}
(MarkedContentBoundingBoxFinder实用程序class)
您可以像这样将它应用到 PDPage pdPage
MarkedContentBoundingBoxFinder boxFinder = new MarkedContentBoundingBoxFinder(pdPage);
boxFinder.processPage(pdPage);
MarkedContent markedContent = boxFinder.content;
(摘自DetermineBoundingBox辅助方法drawMarkedContentBoundingBoxes
)
您可以像这样从 markedContent
对象输出边界框:
void printMarkedContentBoundingBoxes(MarkedContent markedContent, String prefix) {
StringBuilder builder = new StringBuilder();
builder.append(prefix).append(markedContent.tag.getName());
builder.append(' ').append(markedContent.boundingBox);
System.out.println(builder.toString());
for (MarkedContent child : markedContent.children)
printMarkedContentBoundingBoxes(child, prefix + " ");
}
(DetermineBoundingBox辅助方法)
如果是您的示例文档,您会得到
Document java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=128.63946533203125,h=10.2509765625]
Figure java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=44.6771240234375,h=10.2509765625]
P java.awt.geom.Rectangle2D$Double[x=136.79600524902344,y=760.1184081963065,w=43.137100359018405,h=6.383056943803922]
Figure java.awt.geom.Rectangle2D$Double[x=184.2926788330078,y=758.10498046875,w=34.70478820800781,h=10.2509765625]
同样,您可以使用 DetermineBoundingBox 的 drawMarkedContentBoundingBoxes
方法在 PDF 上绘制边界框。对于您的示例文档,您会得到: