如何使用 PDFBox 2 在分页符附近没有空格的情况下密集合并 PDF 文件?

How to dense merge PDF files using PDFBox 2 without whitespace near page breaks?

我们一直在使用基于 iText 的 PdfVeryDenseMergeTool we found in this SO question 将多个 PDF 文件合并为一个 PDF 文件。该工具合并 PDF 时不会在中间留有任何空白,而且如果可能,单个 PDF 也会跨页拆分。

我们想要移植 PdfVeryDenseMergeTool to PDFBox. We found a PDFBox 2 based PdfDenseMergeTool 合并 PDF,如下所示:

个人 PDF:

密集合并 PDF:

我们正在寻找这样的东西(这已经是基于 iText 的 PdfVeryDenseMergeTool 但我们想使用 PDFBox 2 来实现):

在我们尝试进行移植时,我们发现 PdfVeryDenseMergeTool 使用 PageVerticalAnalyzer 扩展 iText PDF 渲染监听器并且每次在PDF。然后使用所有呈现信息将单个 PDF 拆分到多个页面。我们尝试在 PDFBox 2 中寻找类似的 PDF Render Listener,但发现可用的 PDFRenderer class 只有图像渲染方法。所以我们不确定如何将 PageVerticalAnalyzer 移植到 PDFBox。

如果有人可以提出前进的方法,我们将非常感谢他们的帮助。

非常感谢!

编辑 2020 年 2 月 7 日

目前,我们正在从 PDFBox 扩展 PDFGraphicsStreamEngine 来制作一个自定义渲染引擎来跟踪图像、文本行和绘制时的弧线。该自定义引擎将是 PageVerticalAnalyzer 的端口。之后,我们希望能够将 PdfVeryDenseMergeTool 移植到 PDFBox.

编辑 2020 年 2 月 8 日

这是一个非常简单的 PageVerticalAnalyzer 端口,可以处理图像和文本。我是 PDFBox 新手,所以我处理图像的逻辑可能很奇怪。这是基本方法:

Text:对于打印的每个字形,获取 bottomY 并使 topY = bottomY + charHeight,标记那些 top/bottom 点。

Image:每次调用 drawImage() 时,看起来有两种方法可以找出绘制位置。第一个是使用最后一次调用 appendRectangle() 的坐标,第二个是使用最后一次调用 moveTo()、multiple lineTo() 和 closePath()。我优先考虑后者。如果我找不到任何路径(我在一个 PDF 中找到它,在另一个 PDF 中,在 drawImage() 之前,我只找到了 appendRectangle()),我使用前者。如果 none 存在,我不知道该怎么做。这是我假设 PDFBox 使用 moveTo()/lineTo()/closePath() 标记图像坐标的方式:

这是我当前的实现:

import java.awt.geom.Point2D;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.contentstream.PDFGraphicsStreamEngine;
import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.graphics.image.PDImage;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.util.Matrix;
import org.apache.pdfbox.util.Vector;


public class PageVerticalAnalyzer extends PDFGraphicsStreamEngine
{
    /**
     * This is a port of iText based PageVerticalAnalyzer found here
     * https://github.com/mkl-public/testarea-itext5/blob/master/src/main/java/mkl/testarea/itext5/merge/PageVerticalAnalyzer.java
     *
     * @param page PDF Page
     */
    protected PageVerticalAnalyzer(PDPage page)
    {
        super(page);
    }

    public static void main(String[] args) throws IOException
    {
        File file = new File("q2.pdf");

        try (PDDocument doc = PDDocument.load(file))
        {
            PDPage page = doc.getPage(0);
            PageVerticalAnalyzer engine = new PageVerticalAnalyzer(page);
            engine.run();

            System.out.println(engine.verticalFlips);
        }
    }

    /**
     * Runs the engine on the current page.
     *
     * @throws IOException If there is an IO error while drawing the page.
     */
    public void run() throws IOException
    {
        processPage(getPage());

        for (PDAnnotation annotation : getPage().getAnnotations())
        {
            showAnnotation(annotation);
        }
    }

    // All path related stuff

    @Override
    public void clip(int windingRule) throws IOException
    {
        System.out.println("clip");
    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {
        System.out.printf("moveTo %.2f %.2f%n", x, y);
        lastPathBottomTop = new float[] {(Float) null, y};
    }

    @Override
    public void lineTo(float x, float y) throws IOException
    {
        System.out.printf("lineTo %.2f %.2f%n", x, y);
        lastLineTo = new float[] {x, y};
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        System.out.printf("curveTo %.2f %.2f, %.2f %.2f, %.2f %.2f%n", x1, y1, x2, y2, x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException
    {
        // if you want to build paths, you'll need to keep track of this like PageDrawer does
        return new Point2D.Float(0, 0);
    }

    @Override
    public void closePath() throws IOException
    {
        System.out.println("closePath");
        lastPathBottomTop[0] = lastLineTo[1];
        lastLineTo = null;
    }

    @Override
    public void endPath() throws IOException
    {
        System.out.println("endPath");
    }

    @Override
    public void strokePath() throws IOException
    {
        System.out.println("strokePath");
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        System.out.println("fillPath");
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException
    {
        System.out.println("fillAndStrokePath");
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException
    {
        System.out.println("shadingFill " + shadingName.toString());
    }

    // Rectangle related stuff

    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
    {
        System.out.printf("appendRectangle %.2f %.2f, %.2f %.2f, %.2f %.2f, %.2f %.2f%n",
                p0.getX(), p0.getY(), p1.getX(), p1.getY(),
                p2.getX(), p2.getY(), p3.getX(), p3.getY());

        lastRectBottomTop = new float[] {(float) p0.getY(), (float) p3.getY()};
    }

    // Image drawing

    @Override
    public void drawImage(PDImage pdImage) throws IOException
    {
        System.out.println("drawImage");
        if (lastPathBottomTop != null) {
            addVerticalUseSection(lastPathBottomTop[0], lastPathBottomTop[1]);  
        } else if (lastRectBottomTop != null ){
            addVerticalUseSection(lastRectBottomTop[0], lastRectBottomTop[1]);
        } else {
            throw new Error("Drawing image without last reference!");
        }

        lastPathBottomTop = null;
        lastRectBottomTop = null;

    }

    // All text related stuff

    @Override
    public void showTextString(byte[] string) throws IOException
    {
        System.out.print("showTextString \"");
        super.showTextString(string);
        System.out.println("\"");
    }

    @Override
    public void showTextStrings(COSArray array) throws IOException
    {
        System.out.print("showTextStrings \"");
        super.showTextStrings(array);
        System.out.println("\"");
    }

    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode,
                             Vector displacement) throws IOException
    {
        // print the actual character that is being rendered 
        System.out.print(unicode);

        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);

        // rendering matrix seems to contain bounding box of dimensions the char
        // and an x/y point where bounding box starts
        //System.out.println(textRenderingMatrix.toString());

        // y of the bottom of the char 
        // not sure why the y value is in the 8th column
        // when I print the matrix, it shows up in the 6th column
        float yBottom = textRenderingMatrix.getValue(0, 7);

        // height of the char
        // using the value in the first column as the char height
        float yTop =  yBottom + textRenderingMatrix.getValue(0, 0);

        addVerticalUseSection(yBottom, yTop);
    }

    // Keeping track of bottom/top point pairs
    void addVerticalUseSection(float from, float to)
    {
        if (to < from)
        {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.size(); i++)
        {
            float flip = verticalFlips.get(i);
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.size(); j++)
            {
                flip = verticalFlips.get(j);
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        boolean fromOutsideInterval = i%2==0;
        boolean toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.remove(j);
        if (toOutsideInterval)
            verticalFlips.add(i, to);
        if (fromOutsideInterval)
            verticalFlips.add(i, from);
    }

    final List<Float> verticalFlips = new ArrayList<Float>();
    private float[] lastRectBottomTop;
    private float[] lastPathBottomTop;
    private float[] lastLineTo;

}

我正在寻找以下问题的答案:

此答案与原始 iText 版本存在相同的问题。

PageVerticalAnalyzer

的端口

可以按如下方式将 PageVerticalAnalyzer 从 iText 移植到 PDFBox:

public class PageVerticalAnalyzer extends PDFGraphicsStreamEngine {
    protected PageVerticalAnalyzer(PDPage page) {
        super(page);
    }

    public List<Float> getVerticalFlips() {
        return verticalFlips;
    }

    //
    // Text
    //
    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
        Shape shape = calculateGlyphBounds(textRenderingMatrix, font, code);
        if (shape != null) {
            Rectangle2D rect = shape.getBounds2D();
            addVerticalUseSection(rect.getMinY(), rect.getMaxY());
        }
    }

    /**
     * Copy of <code>org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix, PDFont, int)</code>.
     */
    private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
    {
        GeneralPath path = null;
        AffineTransform at = textRenderingMatrix.createAffineTransform();
        at.concatenate(font.getFontMatrix().createAffineTransform());
        if (font instanceof PDType3Font)
        {
            // It is difficult to calculate the real individual glyph bounds for type 3 fonts
            // because these are not vector fonts, the content stream could contain almost anything
            // that is found in page content streams.
            PDType3Font t3Font = (PDType3Font) font;
            PDType3CharProc charProc = t3Font.getCharProc(code);
            if (charProc != null)
            {
                BoundingBox fontBBox = t3Font.getBoundingBox();
                PDRectangle glyphBBox = charProc.getGlyphBBox();
                if (glyphBBox != null)
                {
                    // PDFBOX-3850: glyph bbox could be larger than the font bbox
                    glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                    glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                    glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                    glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                    path = glyphBBox.toGeneralPath();
                }
            }
        }
        else if (font instanceof PDVectorFont)
        {
            PDVectorFont vectorFont = (PDVectorFont) font;
            path = vectorFont.getPath(code);

            if (font instanceof PDTrueTypeFont)
            {
                PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
                int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
            if (font instanceof PDType0Font)
            {
                PDType0Font t0font = (PDType0Font) font;
                if (t0font.getDescendantFont() instanceof PDCIDFontType2)
                {
                    int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                    at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
                }
            }
        }
        else if (font instanceof PDSimpleFont)
        {
            PDSimpleFont simpleFont = (PDSimpleFont) font;

            // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
            // which is why PDVectorFont is tried first.
            String name = simpleFont.getEncoding().getName(code);
            path = simpleFont.getPath(name);
        }
        else
        {
            // shouldn't happen, please open issue in JIRA
            System.out.println("Unknown font class: " + font.getClass());
        }
        if (path == null)
        {
            return null;
        }
        return at.createTransformedShape(path.getBounds2D());
    }

    //
    // Bitmaps
    //
    @Override
    public void drawImage(PDImage pdImage) throws IOException {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        Section section = null;
        for (int x = 0; x < 2; x++) {
            for (int y = 0; y < 2; y++) {
                Point2D.Float point = ctm.transformPoint(x, y);
                if (section == null)
                    section = new Section(point.y);
                else
                    section.extendTo(point.y);
            }
        }
        addVerticalUseSection(section.from, section.to);
    }

    //
    // Paths
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        subPath = null;
        Section section = new Section(p0.getY());
        section.extendTo(p1.getY()).extendTo(p2.getY()).extendTo(p3.getY());
        currentPoint = p0;
    }

    @Override
    public void clip(int windingRule) throws IOException {
    }

    @Override
    public void moveTo(float x, float y) throws IOException {
        subPath = new Section(y);
        path.add(subPath);
        currentPoint = new Point2D.Float(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException {
        if (subPath == null) {
            subPath = new Section(y);
            path.add(subPath);
        } else
            subPath.extendTo(y);
        currentPoint = new Point2D.Float(x, y);
    }

    /**
     * Beware! This is incorrect! The control points may be outside
     * the vertically used range 
     */
    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
        if (subPath == null) {
            subPath = new Section(y1);
            path.add(subPath);
        } else
            subPath.extendTo(y1);
        subPath.extendTo(y2).extendTo(y3);
        currentPoint = new Point2D.Float(x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException {
        return currentPoint;
    }

    @Override
    public void closePath() throws IOException {
    }

    @Override
    public void endPath() throws IOException {
        path.clear();
        subPath = null;
    }

    @Override
    public void strokePath() throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException {
        // TODO Auto-generated method stub
    }

    Point2D currentPoint = null;

    List<Section> path = new ArrayList<Section>();
    Section subPath = null;

    static class Section {
        Section(double value) {
            this((float)value);
        }

        Section(float value) {
            from = value;
            to = value;
        }

        Section extendTo(double value) {
            return extendTo((float)value);
        }

        Section extendTo(float value) {
            if (value < from)
                from = value;
            else if (value > to)
                to = value;
            return this;
        }

        private float from;
        private float to;
    }

    void addVerticalUseSection(double from, double to) {
        addVerticalUseSection((float)from, (float)to);
    }

    void addVerticalUseSection(float from, float to) {
        if (to < from) {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.size(); i++) {
            float flip = verticalFlips.get(i);
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.size(); j++) {
                flip = verticalFlips.get(j);
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        boolean fromOutsideInterval = i%2==0;
        boolean toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.remove(j);
        if (toOutsideInterval)
            verticalFlips.add(i, to);
        if (fromOutsideInterval)
            verticalFlips.add(i, from);
    }

    final List<Float> verticalFlips = new ArrayList<Float>();
}

(PageVerticalAnalyzer.java)

实现实际上与 BoundingBoxFinder from 的实现类似。就像我从 PDFBox 示例中借用的那样 DrawPrintTextLocations 来确定文本轮廓。

此外,与原始iText5PageVerticalAnalyzer from 相对应的curveTo处理中存在问题,控制点被视为在实际曲线上,但实际上它们通常是不能并且可以远远超出曲线的垂直使用范围。可以使用相应的 AWT classes 代替此处实现的路径处理,但这在 Android 等

上可能是不可能的

就像那里一样class忽略了注释,但是iText5密集合并也忽略了注释。而这个class也忽略了剪辑路径...

PdfVeryDenseMergeTool

的端口
public class PdfVeryDenseMergeTool {
    public PdfVeryDenseMergeTool(PDRectangle size, float top, float bottom, float gap)
    {
        this.pageSize = size;
        this.topMargin = top;
        this.bottomMargin = bottom;
        this.gap = gap;
    }

    public void merge(OutputStream outputStream, Iterable<PDDocument> inputs) throws IOException
    {
        try
        {
            openDocument();
            for (PDDocument input: inputs)
            {
                merge(input);
            }
            if (currentContents != null) {
                currentContents.close();
                currentContents = null;
            }
            document.save(outputStream);
        }
        finally
        {
            closeDocument();
        }
        
    }

    void openDocument() throws IOException
    {
        document = new PDDocument();
        newPage();
    }

    void closeDocument() throws IOException
    {
        try
        {
            if (currentContents != null) {
                currentContents.close();
                currentContents = null;
            }
            document.close();
        }
        finally
        {
            this.document = null;
            this.yPosition = 0;
        }
    }
    
    void newPage() throws IOException
    {
        if (currentContents != null) {
            currentContents.close();
            currentContents = null;
        }
        currentPage = new PDPage(pageSize);
        document.addPage(currentPage);
        yPosition = pageSize.getUpperRightY() - topMargin;
        currentContents = new PDPageContentStream(document, currentPage);
    }

    void merge(PDDocument input) throws IOException
    {
        for (PDPage page : input.getPages())
        {
            merge(input, page);
        }
    }

    void merge(PDDocument sourceDoc, PDPage page) throws IOException
    {
        PDRectangle pageSizeToImport = page.getCropBox();

        PageVerticalAnalyzer analyzer = new PageVerticalAnalyzer(page);
        analyzer.processPage(page);
        List<Float> verticalFlips = analyzer.getVerticalFlips();
        if (verticalFlips.size() < 2)
            return;

        LayerUtility layerUtility = new LayerUtility(document);
        PDFormXObject form = layerUtility.importPageAsForm(sourceDoc, page);

        int startFlip = verticalFlips.size() - 1;
        boolean first = true;
        while (startFlip > 0)
        {
            if (!first)
                newPage();

            float freeSpace = yPosition - pageSize.getLowerLeftY() - bottomMargin;
            int endFlip = startFlip + 1;
            while ((endFlip > 1) && (verticalFlips.get(startFlip) - verticalFlips.get(endFlip - 2) < freeSpace))
                endFlip -=2;
            if (endFlip < startFlip)
            {
                float height = verticalFlips.get(startFlip) - verticalFlips.get(endFlip);

                currentContents.saveGraphicsState();
                currentContents.addRect(0, yPosition - height, pageSizeToImport.getWidth(), height);
                currentContents.clip();
                Matrix matrix = Matrix.getTranslateInstance(0, (float)(yPosition - (verticalFlips.get(startFlip) - pageSizeToImport.getLowerLeftY())));
                currentContents.transform(matrix);
                currentContents.drawForm(form);
                currentContents.restoreGraphicsState();

                yPosition -= height + gap;
                startFlip = endFlip - 1;
            }
            else if (!first) 
                throw new IllegalArgumentException(String.format("Page %s content sections too large.", page));
            first = false;
        }
    }

    PDDocument document = null;
    PDPage currentPage = null;
    PDPageContentStream currentContents = null;
    float yPosition = 0; 

    final PDRectangle pageSize;
    final float topMargin;
    final float bottomMargin;
    final float gap;
}

(PdfVeryDenseMergeTool.java)

这本质上是 iText 5 的一个简单端口 PdfVeryDenseMergeTool,没有什么特别之处。

PdfVeryDenseMergeTool

的用法

只需创建一个带有格式信息的 PdfVeryDenseMergeTool 实例,然后使用 PDDocument 个实例作为源开始合并:

PDDocument document1 = ...;
...
PDDocument documentN = ...;

PdfVeryDenseMergeTool tool = new PdfVeryDenseMergeTool(PDRectangle.A4, 30, 30, 10);
tool.merge(new FileOutputStream(RESULT_FILE), Arrays.asList(document1, ..., documentN));

(DenseMerging 测试 testVeryDenseMerging)