PDFBox 2.0.3/Java 7 - 将页面从一个 PDF 导入另一个 PDF 时出现 OOM 错误

Question

我有一些代码可以查看大型 PDF（20,000 多页）中的每一页，如果该页面包含特定字符串，那么它将将该页面导入另一个 PDF。

由于出现的次数，它被导入的 PDF 几乎与源 PDF 一样大 - 当它变得太大时，它会爆炸，但出现以下异常：

Exception in thread "main" java.lang.OutofMemoryError: Java heap space
at java.utils.Arrays.copyOf (Unknown Source)
at java.io.ByteArrayOutputStream.toByteArray (Unknown Source)
at org.apache.pdfbox.cos.COSOutputStream.close(COSOutputStream.java:87)
at java.io.FilterOutputStream.close(Unknown Source)
at org.apache.pdfbox.cos.COSStream.close(COSStream.java:223)
at org.apache.pdfbox.pdmodel.common.PDStream.<init>(PDStream.java:138)
at org.apache.pdfbox.pdmodel.common.PDStream.<init>(PDStream.java:104)
at org.apache.pdfbox.pdfmodel.PDDocument.importPage(PDDocument.java:562)
at ExtractPage.extractString(ExtractPage.java:57)
at RunApp.run(RunApp.java:15)

我已经研究了这个问题，使用临时文件进行流式传输似乎可以解决我的问题。但是，我就是不知道如何将它实现到我的代码中。

我确实有一个解决方法，我可以将页面批处理成单独的文件，然后使用提到的方法合并它们 here - 但是，避免这种情况肯定会更有效和更清晰.

请在下面查看我的代码摘要：

File sourceFile = new File (C:\Temp\extractFROM.pdf);
PDDocument sourceDocument = PDDocument.load(SourceFile, MemoryUsageSetting.setupTempFileOnly();
PDPageTree sourcePageTree = sourceDocument.getDocumentCatalog().getPages(); 
PDDocument tempDocument = new PDDocument (MemoryUsageSetting.setupTempFileOnly())

for (PDPage page : sourcePageTree) {
// Code to extract page text and confirm if contains String
if (above psuedo code is true) {
tempDocument.importPage(page);
}
}

tempDocument.save(sourceFile);

导出大约 7000 页后，它就会在 tempDocument.importPage(page) 行爆炸。它适用于低于该数字的 PDF。

有人可以帮忙吗？

Answer 1

进入 OutofMemoryError 的程序运行可能存在内存泄漏，或者它可能只是需要更多内存才能正常运行。

因此，尝试在这种情况下进行的一种更改是简单地增加分配给程序的内存。如果程序然后运行s 没有问题，您可以认为这是一个修复。只要分配的内存不变得完全不合理，就是...

这似乎是这里的情况，正如 op 确认的那样

I have increased the heap as a run configuration to 670mb (The maximum i can secure with my client equipment) and this has successfully resolved the issue - In fact, i tried it on a PDF twice the size as the original failing PDF, and it easily managed this as well.

PDFBox 2.0.3/Java 7 - 将页面从一个 PDF 导入另一个 PDF 时出现 OOM 错误

PDFBox 2.0.3/Java 7 - OOM Error when importing page from one PDF to another

pdfbox

java-7