如何使用 PDFBox 确定实际 PDF 内容的位置?

How do determine location of actual PDF content with PDFBox?

我们正在使用 PDFBox 从 Java 桌面应用程序打印一些 PDF,但这些 PDF 包含太多空白(很遗憾,无法修复 PDF 生成器)。

我遇到的问题是确定页面上实际内容的位置,因为 crop/media/trim/art/bleed 框没有用。有没有比将页面渲染成图像并检查哪些像素保持白色更简单有效的方法?

正如您在评论中提到的那样

it can be assumed that there is no background or other elements that would need special handling,

我将展示一个没有任何此类特殊处理的基本解决方案。

一个基本的边界框查找器

要在不实际渲染位图和检查位图像素的情况下找到边界框,必须扫描页面内容流的所有指令以及从那里引用的任何 XObject。确定每条指令绘制的东西的边界框,并最终将它们组合成一个框。

此处介绍的简单框查找器通过简单地返回它们联合的边界框来组合它们。

对于扫描内容流的指令,PDFBox 提供了一些基于 PDFStreamEngine 的 classes。简单的框查找器派生自 PDFGraphicsStreamEngine,它通过一些与矢量图形相关的方法扩展了 PDFStreamEngine

public class BoundingBoxFinder extends PDFGraphicsStreamEngine {
    public BoundingBoxFinder(PDPage page) {
        super(page);
    }

    public Rectangle2D getBoundingBox() {
        return rectangle;
    }

    //
    // Text
    //
    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
        Shape shape = calculateGlyphBounds(textRenderingMatrix, font, code);
        if (shape != null) {
            Rectangle2D rect = shape.getBounds2D();
            add(rect);
        }
    }

    /**
     * Copy of <code>org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix, PDFont, int)</code>.
     */
    private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
    {
        GeneralPath path = null;
        AffineTransform at = textRenderingMatrix.createAffineTransform();
        at.concatenate(font.getFontMatrix().createAffineTransform());
        if (font instanceof PDType3Font)
        {
            // It is difficult to calculate the real individual glyph bounds for type 3 fonts
            // because these are not vector fonts, the content stream could contain almost anything
            // that is found in page content streams.
            PDType3Font t3Font = (PDType3Font) font;
            PDType3CharProc charProc = t3Font.getCharProc(code);
            if (charProc != null)
            {
                BoundingBox fontBBox = t3Font.getBoundingBox();
                PDRectangle glyphBBox = charProc.getGlyphBBox();
                if (glyphBBox != null)
                {
                    // PDFBOX-3850: glyph bbox could be larger than the font bbox
                    glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                    glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                    glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                    glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                    path = glyphBBox.toGeneralPath();
                }
            }
        }
        else if (font instanceof PDVectorFont)
        {
            PDVectorFont vectorFont = (PDVectorFont) font;
            path = vectorFont.getPath(code);

            if (font instanceof PDTrueTypeFont)
            {
                PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
                int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
            if (font instanceof PDType0Font)
            {
                PDType0Font t0font = (PDType0Font) font;
                if (t0font.getDescendantFont() instanceof PDCIDFontType2)
                {
                    int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                    at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
                }
            }
        }
        else if (font instanceof PDSimpleFont)
        {
            PDSimpleFont simpleFont = (PDSimpleFont) font;

            // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
            // which is why PDVectorFont is tried first.
            String name = simpleFont.getEncoding().getName(code);
            path = simpleFont.getPath(name);
        }
        else
        {
            // shouldn't happen, please open issue in JIRA
            System.out.println("Unknown font class: " + font.getClass());
        }
        if (path == null)
        {
            return null;
        }
        return at.createTransformedShape(path.getBounds2D());
    }

    //
    // Bitmaps
    //
    @Override
    public void drawImage(PDImage pdImage) throws IOException {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        for (int x = 0; x < 2; x++) {
            for (int y = 0; y < 2; y++) {
                add(ctm.transformPoint(x, y));
            }
        }
    }

    //
    // Paths
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        addToPath(p0, p1, p2, p3);
    }

    @Override
    public void clip(int windingRule) throws IOException {
    }

    @Override
    public void moveTo(float x, float y) throws IOException {
        addToPath(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException {
        addToPath(x, y);
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
        addToPath(x1, y1);
        addToPath(x2, y2);
        addToPath(x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException {
        return null;
    }

    @Override
    public void closePath() throws IOException {
    }

    @Override
    public void endPath() throws IOException {
        rectanglePath = null;
    }

    @Override
    public void strokePath() throws IOException {
        addPath();
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        addPath();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        addPath();
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException {
    }

    void addToPath(Point2D... points) {
        Arrays.asList(points).forEach(p -> addToPath(p.getX(), p.getY()));
    }

    void addToPath(double newx, double newy) {
        if (rectanglePath == null) {
            rectanglePath = new Rectangle2D.Double(newx, newy, 0, 0);
        } else {
            rectanglePath.add(newx, newy);
        }
    }

    void addPath() {
        if (rectanglePath != null) {
            add(rectanglePath);
            rectanglePath = null;
        }
    }

    void add(Rectangle2D rect) {
        if (rectangle == null) {
            rectangle = new Rectangle2D.Double();
            rectangle.setRect(rect);
        } else {
            rectangle.add(rect);
        }
    }

    void add(Point2D... points) {
        for (Point2D point : points) {
            add(point.getX(), point.getY());
        }
    }

    void add(double newx, double newy) {
        if (rectangle == null) {
            rectangle = new Rectangle2D.Double(newx, newy, 0, 0);
        } else {
            rectangle.add(newx, newy);
        }
    }

    Rectangle2D rectanglePath = null;
    Rectangle2D rectangle = null;
}

(github 上的BoundingBoxFinder

如您所见,我从 PDFBox 示例 class.

中借用了 calculateGlyphBounds 辅助方法

用法示例

您可以像这样使用 BoundingBoxFinderPDDocument pdDocument:

的给定 PDPage pdPage 沿边界框边缘绘制边界线
void drawBoundingBox(PDDocument pdDocument, PDPage pdPage) throws IOException {
    BoundingBoxFinder boxFinder = new BoundingBoxFinder(pdPage);
    boxFinder.processPage(pdPage);
    Rectangle2D box = boxFinder.getBoundingBox();
    if (box != null) {
        try (   PDPageContentStream canvas = new PDPageContentStream(pdDocument, pdPage, AppendMode.APPEND, true, true)) {
            canvas.setStrokingColor(Color.magenta);
            canvas.addRect((float)box.getMinX(), (float)box.getMinY(), (float)box.getWidth(), (float)box.getHeight());
            canvas.stroke();
        }
    }
}

(DetermineBoundingBox辅助方法)

结果如下所示:

只是概念验证

注意,BoundingBoxFinder 真的不是很复杂;特别是它不会忽略不可见的内容,如白色背景矩形、在“不可见”渲染模式下绘制的文本、白色填充路径覆盖的任意内容、位图图像的白色部分……此外,它确实忽略了剪辑路径,很奇怪混合模式、注释、...

扩展 class 以正确处理这些情况非常简单,但要添加的代码总和将超出堆栈溢出答案的范围。


对于这个答案中的代码,我使用了当前的 PDFBox 3.0.0-SNAPSHOT 开发分支,但对于当前的 2.x 版本它也应该开箱即用。