从投资组合中提取文件夹 pdf java

extract folders from portfolio pdf java

我有一个包含文件夹、子文件夹和文件的作品集 pdf。我需要在 java 中使用 iText 提取与文件夹、子文件夹和文件相同的结构。我只得到带有 EMBEDEDFILES 的文件。什么是获取文件夹的方式。

请找到我正在使用的代码。此代码仅提供文件夹中的文件。

public static void extractAttachments(String src, String dir) throws         IOException
{
    File folder = new File(dir);
    folder.mkdirs();

    PdfReader reader = new PdfReader(src);

    PdfDictionary root = reader.getCatalog();

    PdfDictionary names = root.getAsDict(PdfName.NAMES);
    System.out.println(""+names.getKeys().toString());
    PdfDictionary embedded = names.getAsDict(PdfName.EMBEDDEDFILES);
    System.out.println(""+embedded.toString());

    PdfArray filespecs = embedded.getAsArray(PdfName.NAMES);

    System.out.println(filespecs.getAsString(root1));
    for (int i = 0; i < filespecs.size();)
    {
        extractAttachment(reader, folder, filespecs.getAsString(i++),
                filespecs.getAsDict(i++));
    }
}

protected static void extractAttachment(PdfReader reader, File dir, PdfString name, PdfDictionary filespec)
        throws IOException
{
    PRStream stream;
    FileOutputStream fos;
    String filename;
    PdfArray parent;
    PdfDictionary refs = filespec.getAsDict(PdfName.EF);
    //System.out.println(""+refs.getKeys().toString());

    for (Object key : refs.getKeys())
    {
        stream = (PRStream)         PdfReader.getPdfObject(refs.getAsIndirectObject((PdfName) key));

        filename = filespec.getAsString((PdfName) key).toString();

        // System.out.println("" + filename);
        fos = new FileOutputStream(new File(dir, filename));
        fos.write(PdfReader.getStreamBytes(stream));
        fos.flush();
        fos.close();
    }
}

OP 在提取投资组合文件时尝试复制的文件夹结构在 Adobe® Supplement to the ISO 32000, BaseVersion: 1.7, ExtensionLevel: 3 中指定。因此,它 不是 当前 PDF 标准的一部分,因此不需要 PDF 处理软件来理解此类信息。不过,它看起来像是计划添加到即将发布的 PDF-2 (ISO 32000-2) 标准中。

因此,要将投资组合文件提取到关联的文件夹结构中,我们必须检索 Adob​​e® 补充中指定的文件夹信息:

Beginning with extension level 3, a portable collection can contain a Folders object for the purpose of organizing files into a hierarchical structure. The structure is represented by a tree with a single root folder acting as the common ancestor for all other folders and files in the collection. The single root folder is referenced in the Folders entry of Table 8.6 on page 29.

Table 8.6c describes the entries in a folder dictionary

  • ID integer (Required; ExtensionLevel 3) A non-negative integer value representing the unique folder identification number. Two folders shall not share the same ID value.

    The folder ID value appears as part of the name tree key of any file associated with this folder. A detailed description of the association between folder and files can be found after this table.

  • Name text string (Required; ExtensionLevel 3) A file name representing the name of the folder. Two sibling folders shall not share the same name following case normalization.

  • Child dictionary (Required if the folder has any descendents; ExtensionLevel 3) An indirect reference to the first child folder of this folder.

  • Next dictionary (Required for all but the last item at each level; ExtensionLevel 3) An indirect reference to the next sibling folder at this level.

(section 8.2.4 Collections)

例如像这样:

static Map<Integer, File> retrieveFolders(PdfReader reader, File baseDir) throws DocumentException
{
    Map<Integer, File> result = new HashMap<Integer, File>();

    PdfDictionary root = reader.getCatalog();
    PdfDictionary collection = root.getAsDict(PdfName.COLLECTION);
    if (collection == null)
        throw new DocumentException("Document has no Collection dictionary");
    PdfDictionary folders = collection.getAsDict(FOLDERS);
    if (folders == null)
        throw new DocumentException("Document collection has no folders dictionary");

    collectFolders(result, folders, baseDir);

    return result;
}

static void collectFolders(Map<Integer, File> collection, PdfDictionary folder, File baseDir)
{
    PdfString name = folder.getAsString(PdfName.NAME);
    File folderDir = new File(baseDir, name.toString());
    folderDir.mkdirs();
    PdfNumber id = folder.getAsNumber(PdfName.ID);
    collection.put(id.intValue(), folderDir);

    PdfDictionary next = folder.getAsDict(PdfName.NEXT);
    if (next != null)
        collectFolders(collection, next, baseDir);
    PdfDictionary child = folder.getAsDict(CHILD);
    if (child != null)
        collectFolders(collection, child, folderDir);
}

final static PdfName FOLDERS = new PdfName("Folders");
final static PdfName CHILD = new PdfName("Child");

(摘自PortfolioFileExtraction.java

并在写入文件时使用这些检索到的文件夹信息。

文件和文件夹的关联在 Adob​​e® 补充中指定如下:

As previously mentioned, files in the EmbeddedFiles name tree are associated with folders by a special naming convention applied to the name tree key strings. Strings that conform to the following rules serve to associate the corresponding file with a folder:

  • The name tree keys are PDF text strings.
  • The first character, excluding any byte order marker, is U+003C, the LESS-THAN SIGN (<).
  • The following characters shall one or more digits (0 to 9) followed by the closing U+003E, the GREATER-THAN SIGN (>)
  • The remainder of the string is a file name.

The section of the string enclosed by LESS-THAN SIGN GREATER-THAN SIGN(<>) is interpreted as a numeric value that specifies the ID value of the folder with which the file is associated. The value shall correspond to a folder ID. The section of the string following the folder ID tag represents the file name of the embedded file.

Files in the EmbeddedFiles name tree that do not conform to these rules shall be treated as associated with the root folder.

(section 8.2.4 Collections)

您的方法可以像这样扩展:

public static void extractAttachmentsWithFolders(PdfReader reader, String dir) throws IOException, DocumentException
{
    File folder = new File(dir);
    folder.mkdirs();

    Map<Integer, File> folders = retrieveFolders(reader, folder);

    PdfDictionary root = reader.getCatalog();

    PdfDictionary names = root.getAsDict(PdfName.NAMES);
    System.out.println("" + names.getKeys().toString());
    PdfDictionary embedded = names.getAsDict(PdfName.EMBEDDEDFILES);
    System.out.println("" + embedded.toString());

    PdfArray filespecs = embedded.getAsArray(PdfName.NAMES);

    for (int i = 0; i < filespecs.size();)
    {
        extractAttachment(reader, folders, folder, filespecs.getAsString(i++), filespecs.getAsDict(i++));
    }
}

protected static void extractAttachment(PdfReader reader, Map<Integer, File> dirs, File dir, PdfString name, PdfDictionary filespec) throws IOException
{
    PRStream stream;
    FileOutputStream fos;
    String filename;
    PdfDictionary refs = filespec.getAsDict(PdfName.EF);

    File dirHere = dir;
    String nameString = name.toUnicodeString();
    if (nameString.startsWith("<"))
    {
        int closing = nameString.indexOf('>');
        if (closing > 0)
        {
            int folderId = Integer.parseInt(nameString.substring(1, closing));
            File folderFile = dirs.get(folderId);
            if (folderFile != null)
                dirHere = folderFile;
        }
    }

    for (PdfName key : refs.getKeys())
    {
        stream = (PRStream) PdfReader.getPdfObject(refs.getAsIndirectObject(key));

        filename = filespec.getAsString(key).toString();

        fos = new FileOutputStream(new File(dirHere, filename));
        fos.write(PdfReader.getStreamBytes(stream));
        fos.flush();
        fos.close();
    }
}

(摘自PortfolioFileExtraction.java

将这些方法应用到您的示例 PDF(例如,使用 PortfolioFileExtraction.java 中的测试方法 testSamplePortfolio11Folders)得到

Root
│   ThumbImpression.pdf
│
├───Folder 1
│   │   EStampPdf.pdf
│   │   Presentation.pdf
│   │
│   ├───Folder 11
│   │   │   Test.pdf
│   │   │
│   │   └───Folder 111
│   └───Folder 12
└───Folder 2
        SealDeed.pdf