从 PDF 中提取不可选择的内容

Question

我正在使用 Apache PDFBox 从 PDF 文件中提取页面，但我找不到提取不可选择的内容（文本或图像）的方法。对于可从 PDF 文件中选择的内容，没有问题。

请注意，有问题的 PDF 对复制内容没有任何限制，至少从我在文件 "Document Restrictions Summary" 上看到的情况来看是这样：它们都有 "Content Copying" 和 "Content Copying for Accessbility" allowed! 在同一个 PDF 文件中，有些内容是可选的，而其他部分则不是。发生的情况是，提取的页面带有 "holes"，即它们只有 PDF 的可选部分。但是在 MS Word 上，如果我将 PDF 添加为对象，PDF 页面的全部内容就会出现！所以我希望对 PDFBox 库或任何其他 Java 库做同样的事情！

这是我用来将 PDF 页面转换为图像的代码：

private void convertPdfToImage(File pdfFile, int pdfId) throws IOException {
   PDDocument document = PDDocument.loadNonSeq(pdfFile, null);
   List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
   for (PDPage pdPage : pdPages) { 
       BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
       ImageIOUtil.writeImage(bim, TEMP_FILEPATH + pdfId + ".png", 300);
   }
   document.close();
}

有没有办法使用此 Apache PDFBox 库（或任何其他类似库）从 PDF 中提取不可选择的内容？或者这根本不可能？如果确实不是，为什么？

非常感谢您的帮助！

编辑：我使用 Adobe Reader 作为 PDF 查看器和 PDFBox v1.8。这是一个示例 PDF：https://dl.dropboxusercontent.com/u/2815529/test.pdf

Answer 1

有问题的两个图像，右上角的 fischer 徽标和稍微向下的小草图，每个图像都是通过用平铺模式填充页面上的一个区域绘制的，而平铺模式又在其内容流中绘制各自的图片。

Adobe Reader 不允许 select 模式内容，并且自动图像提取器通常也不会遍历模式资源树。

PDFBox 1.8.10

您可以使用 PDFBox 相当轻松地构建图案图像提取器，例如对于 PDFBox 1.8.10:

public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
    List<PDPage> pages = document.getDocumentCatalog().getAllPages();
    if (pages == null)
        return;

    for (int i = 0; i < pages.size(); i++)
    {
        String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
        extractPatternImages(pages.get(i), pageFormat);
    }
}

public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
    PDResources resources = page.getResources();
    if (resources == null)
        return;
    Map<String, PDPatternResources> patterns = resources.getPatterns();

    for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
    {
        String patternFormat = String.format(pageFormat, "-" + patternEntry.getKey() + "%s", "%s");
        extractPatternImages(patternEntry.getValue(), patternFormat);
    }
}

public void extractPatternImages(PDPatternResources pattern, String patternFormat) throws IOException
{
    COSDictionary resourcesDict = (COSDictionary) pattern.getCOSDictionary().getDictionaryObject(COSName.RESOURCES);
    if (resourcesDict == null)
        return;
    PDResources resources = new PDResources(resourcesDict);
    Map<String, PDXObject> xObjects = resources.getXObjects();
    if (xObjects == null)
        return;

    for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
    {
        PDXObject xObject = entry.getValue();
        String xObjectFormat = String.format(patternFormat, "-" + entry.getKey() + "%s", "%s");
        if (xObject instanceof PDXObjectForm)
            extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
        else if (xObject instanceof PDXObjectImage)
            extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
    }
}

public void extractPatternImages(PDXObjectForm form, String imageFormat) throws IOException
{
    PDResources resources = form.getResources();
    if (resources == null)
        return;
    Map<String, PDXObject> xObjects = resources.getXObjects();
    if (xObjects == null)
        return;

    for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
    {
        PDXObject xObject = entry.getValue();
        String xObjectFormat = String.format(imageFormat, "-" + entry.getKey() + "%s", "%s");
        if (xObject instanceof PDXObjectForm)
            extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
        else if (xObject instanceof PDXObjectImage)
            extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
    }

    Map<String, PDPatternResources> patterns = resources.getPatterns();

    for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
    {
        String patternFormat = String.format(imageFormat, "-" + patternEntry.getKey() + "%s", "%s");
        extractPatternImages(patternEntry.getValue(), patternFormat);
    }
}

public void extractPatternImages(PDXObjectImage image, String imageFormat) throws IOException
{
    image.write2OutputStream(new FileOutputStream(String.format(imageFormat, "", image.getSuffix())));
}

(ExtractPatternImages.java)

我像这样将它应用于您的示例 PDF

public void testtestDrJorge() throws IOException
{
    try (InputStream resource = getClass().getResourceAsStream("testDrJorge.pdf"))
    {
        PDDocument document = PDDocument.load(resource);
        extractPatternImages(document, "testDrJorge%s.%s");;
    }
}

(ExtractPatternImages.java)

并得到两张图片：

`testDrJorge-0-R15-R14.png
testDrJorge-0-R38-R37.png

图像失去了红色部分。这很可能是由于 PDFBox 版本 1.x.x 不正确支持 CMYK 图像的提取，参见。 PDFBOX-2128 (CMYK images are not supported correctly)，你的图片是 CMYK。

PDFBox 2.0.0 候选发布

我将代码更新为 PDFBox 2.0.0（目前仅作为候选发布版提供）：

public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
    PDPageTree pages = document.getDocumentCatalog().getPages();
    if (pages == null)
        return;

    for (int i = 0; i < pages.getCount(); i++)
    {
        String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
        extractPatternImages(pages.get(i), pageFormat);
    }
}

public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
    PDResources resources = page.getResources();
    if (resources == null)
        return;
    Iterable<COSName> patternNames = resources.getPatternNames();

    for (COSName patternName : patternNames)
    {
        String patternFormat = String.format(pageFormat, "-" + patternName + "%s", "%s");
        extractPatternImages(resources.getPattern(patternName), patternFormat);
    }
}

public void extractPatternImages(PDAbstractPattern pattern, String patternFormat) throws IOException
{
    COSDictionary resourcesDict = (COSDictionary) pattern.getCOSObject().getDictionaryObject(COSName.RESOURCES);
    if (resourcesDict == null)
        return;
    PDResources resources = new PDResources(resourcesDict);
    Iterable<COSName> xObjectNames = resources.getXObjectNames();
    if (xObjectNames == null)
        return;

    for (COSName xObjectName : xObjectNames)
    {
        PDXObject xObject = resources.getXObject(xObjectName);
        String xObjectFormat = String.format(patternFormat, "-" + xObjectName + "%s", "%s");
        if (xObject instanceof PDFormXObject)
            extractPatternImages((PDFormXObject)xObject, xObjectFormat);
        else if (xObject instanceof PDImageXObject)
            extractPatternImages((PDImageXObject)xObject, xObjectFormat);
    }
}

public void extractPatternImages(PDFormXObject form, String imageFormat) throws IOException
{
    PDResources resources = form.getResources();
    if (resources == null)
        return;
    Iterable<COSName> xObjectNames = resources.getXObjectNames();
    if (xObjectNames == null)
        return;

    for (COSName xObjectName : xObjectNames)
    {
        PDXObject xObject = resources.getXObject(xObjectName);
        String xObjectFormat = String.format(imageFormat, "-" + xObjectName + "%s", "%s");
        if (xObject instanceof PDFormXObject)
            extractPatternImages((PDFormXObject)xObject, xObjectFormat);
        else if (xObject instanceof PDImageXObject)
            extractPatternImages((PDImageXObject)xObject, xObjectFormat);
    }

    Iterable<COSName> patternNames = resources.getPatternNames();

    for (COSName patternName : patternNames)
    {
        String patternFormat = String.format(imageFormat, "-" + patternName + "%s", "%s");
        extractPatternImages(resources.getPattern(patternName), patternFormat);
    }
}

public void extractPatternImages(PDImageXObject image, String imageFormat) throws IOException
{
    String filename = String.format(imageFormat, "", image.getSuffix());
    ImageIOUtil.writeImage(image.getOpaqueImage(), "png", new FileOutputStream(filename));
}

并获得

testDrJorge-0-COSName{R15}-COSName{R14}.png
testDrJorge-0-COSName{R38}-COSName{R37}.png

看起来有进步... ;)

从 PDF 中提取不可选择的内容

Extract unselectable content from PDF

java

pdf

pdfbox

PDFBox 1.8.10

PDFBox 2.0.0 候选发布