Java PDFBox 列出页面的所有命名目标

Question

对于我的 Java 项目，我需要列出 PDF 页面的所有命名目标。

PDF 及其命名目标是使用 LaTeX 创建的（使用 hypertarget command），例如如下：

\documentclass[12pt]{article}
\usepackage{hyperref} 

\begin{document}

\hypertarget{myImportantString}{}   % the anchor/named destination to be extracted "myImportantString"

Empty example page

\end{document}

如何使用 PDFBox 库版本 2.0.11 提取此 PDF 文档特定页面的所有命名目标？

我在 Internet 或 PDFBox examples 中找不到解决此问题的任何工作代码。这是我当前的（缩小的）代码：

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;

import java.io.File;
import java.util.List;

public class ExtractNamedDests {

    public static void main(String[] args) {

        try {

            int c = 1;
            PDDocument document = PDDocument.load(new File("<path to PDF file>"));

            for (PDPage page : document.getPages()) {
                System.out.println("Page " + c + ":");

                // named destinations seem to be no type of annotations since the list is always empty:
                List<PDAnnotation> annotations = page.getAnnotations();
                System.out.println("    Count annotations: " + annotations.size());

                // How to extract named destinations??
            }
        }catch(Exception e){
            e.printStackTrace();
        }
    }
}

在此示例中，我想从 Java 中的页面提取字符串 "myImportantString"。

编辑： 这是 example PDF file。我使用 PDFBox 版本 2.0.11.

Answer 1

在 Tilman Hausherr 的大力帮助下，我找到了解决方案。它使用他在评论中建议的代码。

方法 getAllNamedDestinations() returns 文档中所有命名目的地的映射（不是注释），带有名称和目的地。命名目的地可以在文档中深度嵌套。因此，方法 traverseKids() 递归地查找所有嵌套的命名目标。

public static Map<String, PDPageDestination> getAllNamedDestinations(PDDocument document){

        Map<String, PDPageDestination> namedDestinations = new HashMap<>(10);

        // get catalog
        PDDocumentCatalog documentCatalog = document.getDocumentCatalog();

        PDDocumentNameDictionary names = documentCatalog.getNames();

        if(names == null)
            return namedDestinations;

        PDDestinationNameTreeNode dests = names.getDests();

        try {
            if (dests.getNames() != null)
                namedDestinations.putAll(dests.getNames());
        } catch (Exception e){ e.printStackTrace(); }

        List<PDNameTreeNode<PDPageDestination>> kids = dests.getKids();

        traverseKids(kids, namedDestinations);

        return namedDestinations;
    }

private static void traverseKids(List<PDNameTreeNode<PDPageDestination>> kids, Map<String, PDPageDestination> namedDestinations){

    if(kids == null)
        return;

    try {
        for(PDNameTreeNode<PDPageDestination> kid : kids){
            if(kid.getNames() != null){
                try {
                    namedDestinations.putAll(kid.getNames());
                } catch (Exception e){ System.out.println("INFO: Duplicate named destinations in document."); e.printStackTrace(); }
            }

            if (kid.getKids() != null)
                traverseKids(kid.getKids(), namedDestinations);
        }

    } catch (Exception e){
        e.printStackTrace();
    }
}

Java PDFBox 列出页面的所有命名目标

Java PDFBox list all named destinations of a page

java

pdf

latex

pdfbox