如何使用 PDDocument.loadNonSeq, 大 pdf stripper/parsing 文本技术

Question

我对解析 pdf 和如何解析 pdf 有一些疑问：

使用的目的是什么

PDDocument.loadNonSeq 方法包含 scratch/temporary 文件？

我有很大的 pdf，我需要解析它并获取文本内容。我使用 PDDocument.load() 然后 PDFTextStripper 逐页提取数据（pdfstripper 有 setStartPage(n) 和 setEndPage(n) 其中 n=n+1 每个页面循环）。使用 loadNonSeq 而不是 load 对内存更有效吗？

例如

File pdfFile =  new File("mypdf.pdf");
File tmp_file =  new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
  PDFTextStripper stripper = new PDFTextStripper();
        Writer destination = new StringWriter();
        String xml="";
        stripper.setStartPage(index);
        stripper.setEndPage(index);
        stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}

上面的代码是否正确使用了 loadNonSeq？在不占用大量内存的情况下逐页阅读 PDF 页面是否是一种好习惯？我使用逐页阅读，因为我需要使用 DOM 内存在 XML 中写入文本（使用剥离技术，我决定为每一页生成一个 XML）

Answer 1

what is the purpose of using PDDocument.loadNonSeq method that include a scratch/temporary file?

PDFBox 实现了两种读取 PDF 文件的方法。

loadNonSeq 是文档应该加载的方式
load 是不应加载文档的方式，但可以尝试以这种方式修复具有损坏的交叉引用的文件

在2.0.0开发分支中，原loadNonSeq的算法现在load使用，原load的算法不再使用

I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n) where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?

使用 loadNonSeq 而不是 load 可能会提高多版本 PDF 的内存使用率，因为它只读取仍然从引用 table 引用的对象，而 load 可以保留更多内存。

不过，我不知道使用临时文件是否会有很大的不同。

is it a good practice to read PDF page per page without vaste in memory?

内部 PDFBox 也逐页解析给定范围。因此，如果您逐页处理剥离器输出，那么逐页解析它当然是可以的。

如何使用 PDDocument.loadNonSeq, 大 pdf stripper/parsing 文本技术

how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique

java

pdf

pdftotext

pdfbox