Apache PDFBOX - 使用拆分时得到 java.lang.OutOfMemoryError（PDDocument 文档）

Question

我正在尝试使用 Apache PDFBOX API V2.0.2 拆分包含 300 页的文档。尝试使用以下代码将 pdf 文件拆分为单页时：

        PDDocument document = PDDocument.load(inputFile);
        Splitter splitter = new Splitter();
        List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here

我收到以下异常

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

这表明 GC 花费了很多时间来清理不符合回收量的堆。

有很多JVM调优方法可以解决这个问题，但是这些都是治标不治本。

最后一点，我使用的是 JDK6，因此在我的 case.Thanks

中不能使用新的 java 8 Consumer

编辑：

这不是 http://Whosebug 的重复问题。com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as:

 1. I do not have the size problem mentioned in the aforementioned
    topic. I am slicing a 270 pages 13.8MB PDF file and after slicing
    the size of each slice is an average of 80KB with total size of
    30.7MB.
 2. The Split throws the exception even before it returns the splitted parts.

我发现只要我不传递整个文档，拆分就可以通过，而是我将其作为 "Batches" 传递，每页 20-30 页，这样就可以了。

Answer 1

PDF Box将拆分后的部分作为PDDocument类型的对象存储在堆中作为对象，这导致堆很快被填满，即使你在每一轮循环后调用close()操作, GC 仍然无法以填充时相同的方式回收堆大小。

一个选项是将文档拆分操作拆分成批次，其中每个批次是一个相对易于管理的块（10 到 40 页）

public void execute() {
    File inputFile = new File(path/to/the/file.pdf);
    PDDocument document = null;
    try {
        document = PDDocument.load(inputFile);

        int start = 1;
        int end = 1;
        int batchSize = 50;
        int finalBatchSize = document.getNumberOfPages() % batchSize;
        int noOfBatches = document.getNumberOfPages() / batchSize;
        for (int i = 1; i <= noOfBatches; i++) {
            start = end;
            end = start + batchSize;
            System.out.println("Batch: " + i + " start: " + start + " end: " + end);
            split(document, start, end);
        }
        // handling the remaining
        start = end;
        end += finalBatchSize;
        System.out.println("Final Batch  start: " + start + " end: " + end);
        split(document, start, end);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        //close the document
    }
}

private void split(PDDocument document, int start, int end) throws IOException {
    List<File> fileList = new ArrayList<File>();
    Splitter splitter = new Splitter();
    splitter.setStartPage(start);
    splitter.setEndPage(end);
    List<PDDocument> splittedDocuments = splitter.split(document);
    String outputPath = Config.INSTANCE.getProperty("outputPath");
    PDFTextStripper stripper = new PDFTextStripper();

    for (int index = 0; index < splittedDocuments.size(); index++) {
        String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
        PDDocument splittedDocument = splittedDocuments.get(index);

        splittedDocument.save(pdfFullPath);
    }
}

Apache PDFBOX - 使用拆分时得到 java.lang.OutOfMemoryError（PDDocument 文档）

Apache PDFBOX - getting java.lang.OutOfMemoryError when using split(PDDocument document)

java

pdfbox