PDFBox:PDDocument 和 PDPage 是否相互引用?

PDFBox: do PDDocument and PDPage have references to one another?

PDPage 对象是否包含对它所属的 PDDocument 的引用?
换句话说,PDPage 是否知道它的 PDDocument?
在应用程序的某处,我有一个 PDDocuments 列表。
这些文档合并为一个新的 PDDocument:

PDFMergerUtility pdfMerger = new PDFMergerUtility();

PDDocument mergedPDDocument = new PDDocument();
for (PDDocument pdfDocument : documentList) {
    pdfMerger.appendDocument(mergedPDDocument, pdfDocument);
}

然后这个 PdDocument 被分成 10 个包:

Splitter splitter = new Splitter();
splitter.setSplitAtPage(bundleSize);
List<PDDocument> bundleList = splitter.split(mergedDocument);

我现在的问题是:
如果我循环遍历列表中这些拆分的 PDDocument 的页面,是否有办法知道页面最初属于哪个 PDDocument?

此外,如果您有一个 PDPage 对象,您能否从中获取信息,例如页码,...? 或者您可以通过其他方式获得吗?

  1. PDPage 对象是否包含对它所属的 PDDocument 的引用?换句话说,PDPage 是否知道其 PDDocument

Unfortunately the PDPage does not contain a reference to its parent PDDocument, but it has a list of all other pages in the document that can be used to navigate between pages without a reference to the parent PDDocument.

  1. 如果您有一个 PDPage 对象,您能否从中获取页码等信息,或者您能否通过其他方式获取?

There is a workaround to get information about the position of a PDPage in the document without the PDDocument available. Each PDPage has a dictionary with information about the size of the page, resources, fonts, content, etc. One of these attributes is called Parent, this is an array of Pages that have all the information needed to create a shallow clone of the PDPage using the constructor PDPage(COSDictionary). The pages are in the correct order so the page number can be obtain by the position of the record in the array.

  1. 如果我循环遍历列表中这些拆分的 PDDocuments 的页面,有没有办法知道页面最初属于哪个 PDDocument

Once you merge the document list into a single document all references to the original documents will be lost. You can confirm this by looking at the Parent object inside the PDPage, go to Parent > Kids > COSObject[n] > Parent and see if the number for Parent is the same for all the elements in the array. In this example Parent is COSName {Parent} : 1781256139; for all pages.

COSName {Parent} : COSObject {
  COSDictionary {
    COSName {Kids} : COSArray {
      COSObject {
        COSDictionary {
          COSName {TrimBox} : COSArray {0; 0; 612; 792;};
          COSName {MediaBox} : COSArray {0; 0; 612; 792;};
          COSName {CropBox} : COSArray {0; 0; 612; 792;};
          COSName {Resources} : COSDictionary {
            ...
          };
          COSName {Contents} : COSObject {
            ...
          };
          COSName {Parent} : 1781256139;
          COSName {StructParents} : COSInt {68};
          COSName {ArtBox} : COSArray {0; 0; 612; 792; };
          COSName {BleedBox} : COSArray {0; 0; 612; 792; };
          COSName {Type} : COSName {Page};
        }
    }

    ...

    COSName {Count} : COSInt {4};
    COSName {Type} : COSName {Pages};
  }
};

源代码

我编写了以下代码来展示如何使用 PDPage 字典中的信息来回导航页面并使用数组中的位置获取页码。

public class PDPageUtils {
    public static void main(String[] args) throws InvalidPasswordException, IOException {
        System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");

        PDDocument document = null;
        try {
            String filename = "src/main/resources/pdf/us-017.pdf";
            document = PDDocument.load(new File(filename));

            System.out.println("listIterator(PDPage)");
            ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
            while (pageIterator.hasNext()) {
                PDPage page = pageIterator.next();
                System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * Returns a <code>ListIterator</code> initialized with the list of pages from
     * the dictionary embedded in the specified <code>PDPage</code>. The current
     * position of this <code>ListIterator</code> is set to the position of the
     * specified <code>PDPage</code>.
     * 
     * @param page the specified <code>PDPage</code>
     * 
     * @see {@link java.util.ListIterator}
     * @see {@link org.apache.pdfbox.pdmodel.PDPage}
     */
    public static ListIterator<PDPage> listIterator(PDPage page) {
        List<PDPage> pages = new LinkedList<PDPage>();

        COSDictionary pageDictionary = page.getCOSObject();
        COSDictionary parentDictionary = pageDictionary.getCOSDictionary(COSName.PARENT);
        COSArray kidsArray = parentDictionary.getCOSArray(COSName.KIDS);

        List<? extends COSBase> kidList = kidsArray.toList();
        for (COSBase kid : kidList) {
            if (kid instanceof COSObject) {
                COSObject kidObject = (COSObject) kid;
                COSBase type = kidObject.getDictionaryObject(COSName.TYPE);
                if (type == COSName.PAGE) {
                    COSBase kidPageBase = kidObject.getObject();
                    if (kidPageBase instanceof COSDictionary) {
                        COSDictionary kidPageDictionary = (COSDictionary) kidPageBase;
                        pages.add(new PDPage(kidPageDictionary));
                    }
                }
            }
        }
        int index = pages.indexOf(page);
        return pages.listIterator(index);
    }
}

示例输出

在本例中,PDF 文档有 4 页,迭代器初始化为第一页。请注意,页码是 previousIndex()

System.out.println("listIterator(PDPage)");
ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
while (pageIterator.hasNext()) {
    PDPage page = pageIterator.next();
    System.out.println("page #: " + pageIterator.previousIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage)
page #: 0, Structural Parent Key: 68
page #: 1, Structural Parent Key: 69
page #: 2, Structural Parent Key: 70
page #: 3, Structural Parent Key: 71

您也可以从最后一页开始向后导航。现在请注意,页码是 nextIndex().

ListIterator<PDPage> pageIterator = listIterator(document.getPage(3));
pageIterator.next();
while (pageIterator.hasPrevious()) {
    PDPage page = pageIterator.previous();
    System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage)
page #: 3, Structural Parent Key: 71
page #: 2, Structural Parent Key: 70
page #: 1, Structural Parent Key: 69
page #: 0, Structural Parent Key: 68