使用 PDFBox 2.x 计算 PDF 图像的最快方法
Fastest way to count PDF images using PDFBox 2.x
我们偶尔会遇到一些非常大的 PDF,里面充满了整页高分辨率图像(文档扫描的结果)。例如,我有一个 1.7GB 的 PDF,其中包含 3500 多张图片。加载文档大约需要 50 秒,但计算图像大约需要 15 分钟。
我确定这是因为图像字节是作为 API 调用的一部分读取的。有没有办法在不实际读取图像字节的情况下提取图像计数?
PDFBox 版本:2.0.2
示例代码:
@Test
public void imageCountIsCorrect() throws Exception {
PDDocument pdf = readPdf();
try {
assertEquals(3558, countImages(pdf));
// assertEquals(3558, countImagesWithExtractor(pdf));
} finally {
if (pdf != null) {
pdf.close();
}
}
}
protected PDDocument readPdf() throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
FileInputStream stream = new FileInputStream("large.pdf");
PDDocument pdf;
try {
pdf = PDDocument.load(stream, MemoryUsageSetting.setupMixed(1024 * 1024 * 250));
} finally {
stream.close();
}
stopWatch.stop();
log.info("PDF loaded: time={}s", stopWatch.getTime() / 1000);
return pdf;
}
protected int countImages(PDDocument pdf) throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
int imageCount = 0;
for (PDPage pdPage : pdf.getPages()) {
PDResources pdResources = pdPage.getResources();
for (COSName cosName : pdResources.getXObjectNames()) {
PDXObject xobject = pdResources.getXObject(cosName);
if (xobject instanceof PDImageXObject) {
imageCount++;
if (imageCount % 100 == 0) {
log.info("Found image: #" + imageCount);
}
}
}
}
stopWatch.stop();
log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
return imageCount;
}
如果我将 countImages 方法更改为依赖 COSName,计数将在不到 1 秒的时间内完成,但我对依赖名称的前缀有点不确定。这似乎是 pdf 编码器的副产品,而不是 PDFBox(我在他们的代码中找不到任何对它的引用):
if (cosName.getName().startsWith("QuickPDFIm")) {
imageCount++;
}
所以之前的方法还有一些额外的缺陷(可能会遗漏内联图像等)。感谢 mkl 和 Tilman Hausherr 的反馈!
TIL - PDF object streams contain useful operator codes!
我的新方法扩展了 PDFStreamEngine 并为在 PDF 内容流中找到的每个 'Do'(绘图对象)运算符增加了一个 imageCount。使用此方法,图像计数仅需数百毫秒:
public class PdfImageCounter extends PDFStreamEngine {
protected int documentImageCount = 0;
public int getDocumentImageCount() {
return documentImageCount;
}
public PdfImageCounter() {
addOperator(new OperatorProcessor() {
@Override
public void process(Operator operator, List<COSBase> arguments) throws IOException {
if (arguments.size() < 1) {
throw new MissingOperandException(operator, arguments);
}
if (isImage(arguments.get(0))) {
documentImageCount++;
}
}
protected Boolean isImage(COSBase base) {
return (base instanceof COSName) &&
context.getResources().isImageXObject((COSName)base);
}
@Override
public String getName() {
return "Do";
}
});
}
}
为每个页面调用它:
protected int countImagesWithProcessor(PDDocument pdf) throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
PdfImageCounter counter = new PdfImageCounter();
for (PDPage pdPage : pdf.getPages()) {
counter.processPage(pdPage);
}
stopWatch.stop();
int imageCount = counter.getDocumentImageCount();
log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
return imageCount;
}
我们偶尔会遇到一些非常大的 PDF,里面充满了整页高分辨率图像(文档扫描的结果)。例如,我有一个 1.7GB 的 PDF,其中包含 3500 多张图片。加载文档大约需要 50 秒,但计算图像大约需要 15 分钟。
我确定这是因为图像字节是作为 API 调用的一部分读取的。有没有办法在不实际读取图像字节的情况下提取图像计数?
PDFBox 版本:2.0.2
示例代码:
@Test
public void imageCountIsCorrect() throws Exception {
PDDocument pdf = readPdf();
try {
assertEquals(3558, countImages(pdf));
// assertEquals(3558, countImagesWithExtractor(pdf));
} finally {
if (pdf != null) {
pdf.close();
}
}
}
protected PDDocument readPdf() throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
FileInputStream stream = new FileInputStream("large.pdf");
PDDocument pdf;
try {
pdf = PDDocument.load(stream, MemoryUsageSetting.setupMixed(1024 * 1024 * 250));
} finally {
stream.close();
}
stopWatch.stop();
log.info("PDF loaded: time={}s", stopWatch.getTime() / 1000);
return pdf;
}
protected int countImages(PDDocument pdf) throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
int imageCount = 0;
for (PDPage pdPage : pdf.getPages()) {
PDResources pdResources = pdPage.getResources();
for (COSName cosName : pdResources.getXObjectNames()) {
PDXObject xobject = pdResources.getXObject(cosName);
if (xobject instanceof PDImageXObject) {
imageCount++;
if (imageCount % 100 == 0) {
log.info("Found image: #" + imageCount);
}
}
}
}
stopWatch.stop();
log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
return imageCount;
}
如果我将 countImages 方法更改为依赖 COSName,计数将在不到 1 秒的时间内完成,但我对依赖名称的前缀有点不确定。这似乎是 pdf 编码器的副产品,而不是 PDFBox(我在他们的代码中找不到任何对它的引用):
if (cosName.getName().startsWith("QuickPDFIm")) {
imageCount++;
}
所以之前的方法还有一些额外的缺陷(可能会遗漏内联图像等)。感谢 mkl 和 Tilman Hausherr 的反馈!
TIL - PDF object streams contain useful operator codes!
我的新方法扩展了 PDFStreamEngine 并为在 PDF 内容流中找到的每个 'Do'(绘图对象)运算符增加了一个 imageCount。使用此方法,图像计数仅需数百毫秒:
public class PdfImageCounter extends PDFStreamEngine {
protected int documentImageCount = 0;
public int getDocumentImageCount() {
return documentImageCount;
}
public PdfImageCounter() {
addOperator(new OperatorProcessor() {
@Override
public void process(Operator operator, List<COSBase> arguments) throws IOException {
if (arguments.size() < 1) {
throw new MissingOperandException(operator, arguments);
}
if (isImage(arguments.get(0))) {
documentImageCount++;
}
}
protected Boolean isImage(COSBase base) {
return (base instanceof COSName) &&
context.getResources().isImageXObject((COSName)base);
}
@Override
public String getName() {
return "Do";
}
});
}
}
为每个页面调用它:
protected int countImagesWithProcessor(PDDocument pdf) throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
PdfImageCounter counter = new PdfImageCounter();
for (PDPage pdPage : pdf.getPages()) {
counter.processPage(pdPage);
}
stopWatch.stop();
int imageCount = counter.getDocumentImageCount();
log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
return imageCount;
}