达到 Apache Tika maxStringLength
Apache Tika maxStringLength reached
我有数千个 11-15mb 的 pdf 文档。我的程序显示我的文档包含超过 100k 个字符。
错误输出:
Exception in thread "main"
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException:
Your document contained more than 100000 characters, and so your
requested limit has been reached. To receive the full text of the
document, increase your limit.
如何将限制增加到 10-15mb?
我找到了一个解决方案,它是新的 Tika facade class 但我找不到将它与我的集成的方法。
Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024);
这是我的代码:
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
String location = "C:\Users\Laptop\Dropbox\MainTextbookTrappe2ndEd.pdf";
FileInputStream inputstream = new FileInputStream(location);
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
输出:
System.out.println("Content of the PDF :" + pcontext);
使用
BodyContentHandler handler = new BodyContentHandler(-1);
禁用限制。
来自 Javadoc:
The internal string buffer is bounded at the given number of
characters. If this write limit is reached, then a SAXException is
thrown.
Parameters: writeLimit
- maximum number of characters to
include in the string, or -1 to disable the write limit
我有数千个 11-15mb 的 pdf 文档。我的程序显示我的文档包含超过 100k 个字符。
错误输出:
Exception in thread "main" org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit.
如何将限制增加到 10-15mb?
我找到了一个解决方案,它是新的 Tika facade class 但我找不到将它与我的集成的方法。
Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024);
这是我的代码:
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
String location = "C:\Users\Laptop\Dropbox\MainTextbookTrappe2ndEd.pdf";
FileInputStream inputstream = new FileInputStream(location);
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
输出:
System.out.println("Content of the PDF :" + pcontext);
使用
BodyContentHandler handler = new BodyContentHandler(-1);
禁用限制。 来自 Javadoc:
The internal string buffer is bounded at the given number of characters. If this write limit is reached, then a SAXException is thrown.
Parameters:writeLimit
- maximum number of characters to include in the string, or -1 to disable the write limit