PDFBox:如何确定矢量图(路径形状)的边界框

PDFBox: how to determine bounding box of vector figure (path shape)

我有 Apache FOP + jEuclid 生成的简单 PDF。此 PDF 包含数学公式和文本的矢量图形:

Link 到 PDF:https://www.dropbox.com/s/w4ksnud78bu9oz5/test.pdf?dl=0

我想知道每个矢量图形的边界框 (x,y,width,height)。我试过这个例子:https://svn.apache.org/repos/asf/pdfbox/tags/2.0.24/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java,但它没有输出任何信息,只有这个:

Processing page: 1

在 Acrobat 中,我可以 select 标签树中的矢量图像并突出显示它们:

我的问题 - 如何通过 PDFBox 确定矢量图像的边界框 API?

只要相关图形被适当标记(就像它们在您的示例文档中一样),您就可以根据 PDFBox PDFGraphicsStreamEngine.

确定它们的边界框

其实可以利用BoundingBoxFinder from (基于PDFGraphicsStreamEngine)来决定一个页面所有内容的边界框,你只需要获取标记内容的边界框信息按标记内容顺序排列。

以下 class 通过将边界框信息存储在 MarkedContext 个对象的层次结构中来实现

public class MarkedContentBoundingBoxFinder extends BoundingBoxFinder {
    public MarkedContentBoundingBoxFinder(PDPage page) {
        super(page);
        contents.add(content);
    }

    @Override
    public void processPage(PDPage page) throws IOException {
        super.processPage(page);
        endMarkedContentSequence();
    }

    @Override
    public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
        MarkedContent current = contents.getLast();
        if (rectangle != null) {
            if (current.boundingBox != null)
                add(current.boundingBox);
            current.boundingBox = rectangle;
        }
        rectangle = null;
        MarkedContent newContent = new MarkedContent(tag, properties);
        contents.addLast(newContent);
        current.children.add(newContent);

        super.beginMarkedContentSequence(tag, properties);
    }

    @Override
    public void endMarkedContentSequence() {
        MarkedContent current = contents.removeLast();
        if (rectangle != null) {
            if (current.boundingBox != null)
                add(current.boundingBox);
            current.boundingBox = (Rectangle2D) rectangle.clone();
        } else if (current.boundingBox != null)
            rectangle = (Rectangle2D) current.boundingBox.clone();

        super.endMarkedContentSequence();
    }

    public static class MarkedContent {
        public MarkedContent(COSName tag, COSDictionary properties) {
            this.tag = tag;
            this.properties = properties;
        }

        public final COSName tag;
        public final COSDictionary properties;
        public final List<MarkedContent> children = new ArrayList<>();
        public Rectangle2D boundingBox = null;
    }

    public final MarkedContent content = new MarkedContent(COSName.DOCUMENT, null);
    public final Deque<MarkedContent> contents = new ArrayDeque<>();
}

(MarkedContentBoundingBoxFinder实用程序class)

您可以像这样将它应用到 PDPage pdPage

MarkedContentBoundingBoxFinder boxFinder = new MarkedContentBoundingBoxFinder(pdPage);
boxFinder.processPage(pdPage);
MarkedContent markedContent = boxFinder.content;

(摘自DetermineBoundingBox辅助方法drawMarkedContentBoundingBoxes

您可以像这样从 markedContent 对象输出边界框:

void printMarkedContentBoundingBoxes(MarkedContent markedContent, String prefix) {
    StringBuilder builder = new StringBuilder();
    builder.append(prefix).append(markedContent.tag.getName());
    builder.append(' ').append(markedContent.boundingBox);
    System.out.println(builder.toString());
    for (MarkedContent child : markedContent.children)
        printMarkedContentBoundingBoxes(child, prefix + "  ");
}

(DetermineBoundingBox辅助方法)

如果是您的示例文档,您会得到

Document java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=128.63946533203125,h=10.2509765625]
  Figure java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=44.6771240234375,h=10.2509765625]
  P java.awt.geom.Rectangle2D$Double[x=136.79600524902344,y=760.1184081963065,w=43.137100359018405,h=6.383056943803922]
  Figure java.awt.geom.Rectangle2D$Double[x=184.2926788330078,y=758.10498046875,w=34.70478820800781,h=10.2509765625]

同样,您可以使用 DetermineBoundingBoxdrawMarkedContentBoundingBoxes 方法在 PDF 上绘制边界框。对于您的示例文档,您会得到: