达到 Apache Tika maxStringLength

Apache Tika maxStringLength reached

我有数千个 11-15mb 的 pdf 文档。我的程序显示我的文档包含超过 100k 个字符。

错误输出:

Exception in thread "main" org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit.

如何将限制增加到 10-15mb?

我找到了一个解决方案,它是新的 Tika facade class 但我找不到将它与我的集成的方法。

  Tika tika = new Tika(); 
  tika.setMaxStringLength(10*1024*1024);

这是我的代码:

  BodyContentHandler handler = new BodyContentHandler();
  Metadata metadata = new Metadata();
  String location = "C:\Users\Laptop\Dropbox\MainTextbookTrappe2ndEd.pdf";
  FileInputStream inputstream = new FileInputStream(location);
  ParseContext pcontext = new ParseContext();
  PDFParser pdfparser = new PDFParser(); 
  pdfparser.parse(inputstream, handler, metadata, pcontext);

输出:

System.out.println("Content of the PDF :" + pcontext);

使用

BodyContentHandler handler = new BodyContentHandler(-1);

禁用限制。 来自 Javadoc:

The internal string buffer is bounded at the given number of characters. If this write limit is reached, then a SAXException is thrown.
Parameters: writeLimit - maximum number of characters to include in the string, or -1 to disable the write limit