从 PDF 中提取不可选择的内容
Extract unselectable content from PDF
我正在使用 Apache PDFBox 从 PDF 文件中提取页面,但我找不到提取不可选择的内容(文本或图像)的方法。对于可从 PDF 文件中选择的内容,没有问题。
请注意,有问题的 PDF 对复制内容没有任何限制,至少从我在文件 "Document Restrictions Summary" 上看到的情况来看是这样:它们都有 "Content Copying" 和 "Content Copying for Accessbility" allowed! 在同一个 PDF 文件中,有些内容是可选的,而其他部分则不是。发生的情况是,提取的页面带有 "holes",即它们只有 PDF 的可选部分。但是在 MS Word 上,如果我将 PDF 添加为对象,PDF 页面的全部内容就会出现!所以我希望对 PDFBox 库或任何其他 Java 库做同样的事情!
这是我用来将 PDF 页面转换为图像的代码:
private void convertPdfToImage(File pdfFile, int pdfId) throws IOException {
PDDocument document = PDDocument.loadNonSeq(pdfFile, null);
List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
for (PDPage pdPage : pdPages) {
BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
ImageIOUtil.writeImage(bim, TEMP_FILEPATH + pdfId + ".png", 300);
}
document.close();
}
有没有办法使用此 Apache PDFBox 库(或任何其他类似库)从 PDF 中提取不可选择的内容?或者这根本不可能?如果确实不是,为什么?
非常感谢您的帮助!
编辑:我使用 Adobe Reader 作为 PDF 查看器和 PDFBox v1.8。这是一个示例 PDF:https://dl.dropboxusercontent.com/u/2815529/test.pdf
有问题的两个图像,右上角的 fischer 徽标和稍微向下的小草图,每个图像都是通过用平铺模式填充页面上的一个区域绘制的,而平铺模式又在其内容流中绘制各自的图片。
Adobe Reader 不允许 select 模式内容,并且自动图像提取器通常也不会遍历 模式 资源树。
PDFBox 1.8.10
您可以使用 PDFBox 相当轻松地构建图案图像提取器,例如对于 PDFBox 1.8.10:
public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
if (pages == null)
return;
for (int i = 0; i < pages.size(); i++)
{
String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
extractPatternImages(pages.get(i), pageFormat);
}
}
public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
PDResources resources = page.getResources();
if (resources == null)
return;
Map<String, PDPatternResources> patterns = resources.getPatterns();
for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
{
String patternFormat = String.format(pageFormat, "-" + patternEntry.getKey() + "%s", "%s");
extractPatternImages(patternEntry.getValue(), patternFormat);
}
}
public void extractPatternImages(PDPatternResources pattern, String patternFormat) throws IOException
{
COSDictionary resourcesDict = (COSDictionary) pattern.getCOSDictionary().getDictionaryObject(COSName.RESOURCES);
if (resourcesDict == null)
return;
PDResources resources = new PDResources(resourcesDict);
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null)
return;
for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
{
PDXObject xObject = entry.getValue();
String xObjectFormat = String.format(patternFormat, "-" + entry.getKey() + "%s", "%s");
if (xObject instanceof PDXObjectForm)
extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
else if (xObject instanceof PDXObjectImage)
extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
}
}
public void extractPatternImages(PDXObjectForm form, String imageFormat) throws IOException
{
PDResources resources = form.getResources();
if (resources == null)
return;
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null)
return;
for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
{
PDXObject xObject = entry.getValue();
String xObjectFormat = String.format(imageFormat, "-" + entry.getKey() + "%s", "%s");
if (xObject instanceof PDXObjectForm)
extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
else if (xObject instanceof PDXObjectImage)
extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
}
Map<String, PDPatternResources> patterns = resources.getPatterns();
for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
{
String patternFormat = String.format(imageFormat, "-" + patternEntry.getKey() + "%s", "%s");
extractPatternImages(patternEntry.getValue(), patternFormat);
}
}
public void extractPatternImages(PDXObjectImage image, String imageFormat) throws IOException
{
image.write2OutputStream(new FileOutputStream(String.format(imageFormat, "", image.getSuffix())));
}
我像这样将它应用于您的示例 PDF
public void testtestDrJorge() throws IOException
{
try (InputStream resource = getClass().getResourceAsStream("testDrJorge.pdf"))
{
PDDocument document = PDDocument.load(resource);
extractPatternImages(document, "testDrJorge%s.%s");;
}
}
并得到两张图片:
`testDrJorge-0-R15-R14.png
testDrJorge-0-R38-R37.png
图像失去了红色部分。这很可能是由于 PDFBox 版本 1.x.x 不正确支持 CMYK 图像的提取,参见。 PDFBOX-2128 (CMYK images are not supported correctly),你的图片是 CMYK。
PDFBox 2.0.0 候选发布
我将代码更新为 PDFBox 2.0.0(目前仅作为候选发布版提供):
public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
PDPageTree pages = document.getDocumentCatalog().getPages();
if (pages == null)
return;
for (int i = 0; i < pages.getCount(); i++)
{
String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
extractPatternImages(pages.get(i), pageFormat);
}
}
public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
PDResources resources = page.getResources();
if (resources == null)
return;
Iterable<COSName> patternNames = resources.getPatternNames();
for (COSName patternName : patternNames)
{
String patternFormat = String.format(pageFormat, "-" + patternName + "%s", "%s");
extractPatternImages(resources.getPattern(patternName), patternFormat);
}
}
public void extractPatternImages(PDAbstractPattern pattern, String patternFormat) throws IOException
{
COSDictionary resourcesDict = (COSDictionary) pattern.getCOSObject().getDictionaryObject(COSName.RESOURCES);
if (resourcesDict == null)
return;
PDResources resources = new PDResources(resourcesDict);
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames == null)
return;
for (COSName xObjectName : xObjectNames)
{
PDXObject xObject = resources.getXObject(xObjectName);
String xObjectFormat = String.format(patternFormat, "-" + xObjectName + "%s", "%s");
if (xObject instanceof PDFormXObject)
extractPatternImages((PDFormXObject)xObject, xObjectFormat);
else if (xObject instanceof PDImageXObject)
extractPatternImages((PDImageXObject)xObject, xObjectFormat);
}
}
public void extractPatternImages(PDFormXObject form, String imageFormat) throws IOException
{
PDResources resources = form.getResources();
if (resources == null)
return;
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames == null)
return;
for (COSName xObjectName : xObjectNames)
{
PDXObject xObject = resources.getXObject(xObjectName);
String xObjectFormat = String.format(imageFormat, "-" + xObjectName + "%s", "%s");
if (xObject instanceof PDFormXObject)
extractPatternImages((PDFormXObject)xObject, xObjectFormat);
else if (xObject instanceof PDImageXObject)
extractPatternImages((PDImageXObject)xObject, xObjectFormat);
}
Iterable<COSName> patternNames = resources.getPatternNames();
for (COSName patternName : patternNames)
{
String patternFormat = String.format(imageFormat, "-" + patternName + "%s", "%s");
extractPatternImages(resources.getPattern(patternName), patternFormat);
}
}
public void extractPatternImages(PDImageXObject image, String imageFormat) throws IOException
{
String filename = String.format(imageFormat, "", image.getSuffix());
ImageIOUtil.writeImage(image.getOpaqueImage(), "png", new FileOutputStream(filename));
}
并获得
testDrJorge-0-COSName{R15}-COSName{R14}.png
testDrJorge-0-COSName{R38}-COSName{R37}.png
看起来有进步... ;)
我正在使用 Apache PDFBox 从 PDF 文件中提取页面,但我找不到提取不可选择的内容(文本或图像)的方法。对于可从 PDF 文件中选择的内容,没有问题。
请注意,有问题的 PDF 对复制内容没有任何限制,至少从我在文件 "Document Restrictions Summary" 上看到的情况来看是这样:它们都有 "Content Copying" 和 "Content Copying for Accessbility" allowed! 在同一个 PDF 文件中,有些内容是可选的,而其他部分则不是。发生的情况是,提取的页面带有 "holes",即它们只有 PDF 的可选部分。但是在 MS Word 上,如果我将 PDF 添加为对象,PDF 页面的全部内容就会出现!所以我希望对 PDFBox 库或任何其他 Java 库做同样的事情!
这是我用来将 PDF 页面转换为图像的代码:
private void convertPdfToImage(File pdfFile, int pdfId) throws IOException {
PDDocument document = PDDocument.loadNonSeq(pdfFile, null);
List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
for (PDPage pdPage : pdPages) {
BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
ImageIOUtil.writeImage(bim, TEMP_FILEPATH + pdfId + ".png", 300);
}
document.close();
}
有没有办法使用此 Apache PDFBox 库(或任何其他类似库)从 PDF 中提取不可选择的内容?或者这根本不可能?如果确实不是,为什么?
非常感谢您的帮助!
编辑:我使用 Adobe Reader 作为 PDF 查看器和 PDFBox v1.8。这是一个示例 PDF:https://dl.dropboxusercontent.com/u/2815529/test.pdf
有问题的两个图像,右上角的 fischer 徽标和稍微向下的小草图,每个图像都是通过用平铺模式填充页面上的一个区域绘制的,而平铺模式又在其内容流中绘制各自的图片。
Adobe Reader 不允许 select 模式内容,并且自动图像提取器通常也不会遍历 模式 资源树。
PDFBox 1.8.10
您可以使用 PDFBox 相当轻松地构建图案图像提取器,例如对于 PDFBox 1.8.10:
public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
if (pages == null)
return;
for (int i = 0; i < pages.size(); i++)
{
String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
extractPatternImages(pages.get(i), pageFormat);
}
}
public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
PDResources resources = page.getResources();
if (resources == null)
return;
Map<String, PDPatternResources> patterns = resources.getPatterns();
for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
{
String patternFormat = String.format(pageFormat, "-" + patternEntry.getKey() + "%s", "%s");
extractPatternImages(patternEntry.getValue(), patternFormat);
}
}
public void extractPatternImages(PDPatternResources pattern, String patternFormat) throws IOException
{
COSDictionary resourcesDict = (COSDictionary) pattern.getCOSDictionary().getDictionaryObject(COSName.RESOURCES);
if (resourcesDict == null)
return;
PDResources resources = new PDResources(resourcesDict);
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null)
return;
for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
{
PDXObject xObject = entry.getValue();
String xObjectFormat = String.format(patternFormat, "-" + entry.getKey() + "%s", "%s");
if (xObject instanceof PDXObjectForm)
extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
else if (xObject instanceof PDXObjectImage)
extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
}
}
public void extractPatternImages(PDXObjectForm form, String imageFormat) throws IOException
{
PDResources resources = form.getResources();
if (resources == null)
return;
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null)
return;
for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
{
PDXObject xObject = entry.getValue();
String xObjectFormat = String.format(imageFormat, "-" + entry.getKey() + "%s", "%s");
if (xObject instanceof PDXObjectForm)
extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
else if (xObject instanceof PDXObjectImage)
extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
}
Map<String, PDPatternResources> patterns = resources.getPatterns();
for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
{
String patternFormat = String.format(imageFormat, "-" + patternEntry.getKey() + "%s", "%s");
extractPatternImages(patternEntry.getValue(), patternFormat);
}
}
public void extractPatternImages(PDXObjectImage image, String imageFormat) throws IOException
{
image.write2OutputStream(new FileOutputStream(String.format(imageFormat, "", image.getSuffix())));
}
我像这样将它应用于您的示例 PDF
public void testtestDrJorge() throws IOException
{
try (InputStream resource = getClass().getResourceAsStream("testDrJorge.pdf"))
{
PDDocument document = PDDocument.load(resource);
extractPatternImages(document, "testDrJorge%s.%s");;
}
}
并得到两张图片:
`testDrJorge-0-R15-R14.png
testDrJorge-0-R38-R37.png
图像失去了红色部分。这很可能是由于 PDFBox 版本 1.x.x 不正确支持 CMYK 图像的提取,参见。 PDFBOX-2128 (CMYK images are not supported correctly),你的图片是 CMYK。
PDFBox 2.0.0 候选发布
我将代码更新为 PDFBox 2.0.0(目前仅作为候选发布版提供):
public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
PDPageTree pages = document.getDocumentCatalog().getPages();
if (pages == null)
return;
for (int i = 0; i < pages.getCount(); i++)
{
String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
extractPatternImages(pages.get(i), pageFormat);
}
}
public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
PDResources resources = page.getResources();
if (resources == null)
return;
Iterable<COSName> patternNames = resources.getPatternNames();
for (COSName patternName : patternNames)
{
String patternFormat = String.format(pageFormat, "-" + patternName + "%s", "%s");
extractPatternImages(resources.getPattern(patternName), patternFormat);
}
}
public void extractPatternImages(PDAbstractPattern pattern, String patternFormat) throws IOException
{
COSDictionary resourcesDict = (COSDictionary) pattern.getCOSObject().getDictionaryObject(COSName.RESOURCES);
if (resourcesDict == null)
return;
PDResources resources = new PDResources(resourcesDict);
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames == null)
return;
for (COSName xObjectName : xObjectNames)
{
PDXObject xObject = resources.getXObject(xObjectName);
String xObjectFormat = String.format(patternFormat, "-" + xObjectName + "%s", "%s");
if (xObject instanceof PDFormXObject)
extractPatternImages((PDFormXObject)xObject, xObjectFormat);
else if (xObject instanceof PDImageXObject)
extractPatternImages((PDImageXObject)xObject, xObjectFormat);
}
}
public void extractPatternImages(PDFormXObject form, String imageFormat) throws IOException
{
PDResources resources = form.getResources();
if (resources == null)
return;
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames == null)
return;
for (COSName xObjectName : xObjectNames)
{
PDXObject xObject = resources.getXObject(xObjectName);
String xObjectFormat = String.format(imageFormat, "-" + xObjectName + "%s", "%s");
if (xObject instanceof PDFormXObject)
extractPatternImages((PDFormXObject)xObject, xObjectFormat);
else if (xObject instanceof PDImageXObject)
extractPatternImages((PDImageXObject)xObject, xObjectFormat);
}
Iterable<COSName> patternNames = resources.getPatternNames();
for (COSName patternName : patternNames)
{
String patternFormat = String.format(imageFormat, "-" + patternName + "%s", "%s");
extractPatternImages(resources.getPattern(patternName), patternFormat);
}
}
public void extractPatternImages(PDImageXObject image, String imageFormat) throws IOException
{
String filename = String.format(imageFormat, "", image.getSuffix());
ImageIOUtil.writeImage(image.getOpaqueImage(), "png", new FileOutputStream(filename));
}
并获得
testDrJorge-0-COSName{R15}-COSName{R14}.png
testDrJorge-0-COSName{R38}-COSName{R37}.png
看起来有进步... ;)