从 pdf 中提取嵌入对象

Question

我将字节数组嵌入到 pdf 文件中 (Java)。现在我正在尝试提取同一个数组。该数组作为 "MOVIE" 文件嵌入。

我找不到任何关于如何做到这一点的线索...

有什么想法吗？

谢谢！

编辑

我使用这段代码来嵌入字节数组：

public static void pack(byte[] file) throws IOException, DocumentException{

    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
    writer.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
    writer.addDeveloperExtension(PdfDeveloperExtension.ADOBE_1_7_EXTENSIONLEVEL3);

    document.open();
    RichMediaAnnotation richMedia = new RichMediaAnnotation(writer, new Rectangle(0,0,0,0));

    PdfFileSpecification fs
        = PdfFileSpecification.fileEmbedded(writer, null, "test.avi", file);
    PdfIndirectReference asset = richMedia.addAsset("test.avi", fs);
    RichMediaConfiguration configuration = new RichMediaConfiguration(PdfName.MOVIE);
    RichMediaInstance instance = new RichMediaInstance(PdfName.MOVIE);
    RichMediaParams flashVars = new RichMediaParams();
    instance.setAsset(asset);
    configuration.addInstance(instance);
    RichMediaActivation activation = new RichMediaActivation();
    richMedia.setActivation(activation);
    PdfAnnotation richMediaAnnotation = richMedia.createAnnotation();
    richMediaAnnotation.setFlags(PdfAnnotation.FLAGS_PRINT);
    writer.addAnnotation(richMediaAnnotation);
    document.close();

Answer 1

我写了一个蛮力方法来提取 PDF 中的所有流并将它们存储为没有扩展名的文件：

public static final String SRC = "resources/pdfs/image.pdf";
public static final String DEST = "results/parse/stream%s";

public static void main(String[] args) throws IOException {
    File file = new File(DEST);
    file.getParentFile().mkdirs();
    new ExtractStreams().parse(SRC, DEST);
}

public void parse(String src, String dest) throws IOException {
    PdfReader reader = new PdfReader(src);
    PdfObject obj;
    for (int i = 1; i <= reader.getXrefSize(); i++) {
        obj = reader.getPdfObject(i);
        if (obj != null && obj.isStream()) {
            PRStream stream = (PRStream)obj;
            byte[] b;
            try {
                b = PdfReader.getStreamBytes(stream);
            }
            catch(UnsupportedPdfException e) {
                b = PdfReader.getStreamBytesRaw(stream);
            }
            FileOutputStream fos = new FileOutputStream(String.format(dest, i));
            fos.write(b);
            fos.flush();
            fos.close();
        }
    }
}

请注意，我将所有作为流的 PDF 对象都作为 PRStream 对象。我也使用两种不同的方法：

当我使用 PdfReader.getStreamBytes(stream) 时，iText 将查看 过滤器。例如：页面内容流由使用 /FlateDecode 压缩的 PDF 语法组成。通过使用 PdfReader.getStreamBytes(stream)，您将获得 未压缩 PDF 语法。
并非所有过滤器都在 iText 中受支持。以 /DCTDecode 为例，它是用于在 PDF 中存储 JPEG 的过滤器。为什么以及如何 "decode" 这样的流？你不会，那是我们使用 PdfReader.getStreamBytesRaw(stream) 的时候，这也是你从 PDF 中获取 AVI 字节所需的方法。

此示例已经为您提供了提取 PDF 流肯定需要的方法。现在由您来找到您需要的流的路径。这需要 iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: All links of existing pdf change the action property to inherit zoom - iText library

你遍历页面字典，然后遍历这个字典的 /Annots 数组（如果它存在的话），而不是检查 /Link 注释（这是在我提到的问题），您必须检查 /RichMedia 注释并从那里检查资产，直到找到包含 AVI 文件的流。 RUPS 将向您展示如何深入注释字典。

从 pdf 中提取嵌入对象

Extracting an embedded object from a pdf

java

pdf

itext

pdfbox