apache tika api 的 BodyContentHandler 中 writelimit 的意义?
Significance of writelimit in BodyContentHandler of apache tika api?
在我们的应用程序中,我们应该检查文件(任何格式)是否受密码保护,
为此,我们使用 Apache Tika API。
代码片段如下所示。
public static boolean isPasswordProtectedFile(File filePart) {
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try {
// parsing the file and testing for Password
parser.parse(FileUtils.openInputStream(filePart), handler, metadata, context);
LOGGER.debug("File is without Password ");
} catch (EncryptedDocumentException e) {
LOGGER.error("File is encrypted with password", e);
return true;
} catch (Exception e) {
LOGGER.error("File parsing failed", e);
}
return false;
}
但是对于我们测试的几个文件来说,这消耗了太多的 CPU。但是如果我们像下面这样创建 BodyContentHandler。然后它完成得更快并且使用更少 CPU。
BodyContentHandler handler = new BodyContentHandler(-1);
我仔细阅读了文档,但未能正确理解。期待一个可能的原因。提前致谢。
根据文档,它说
https://tika.apache.org/1.4/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler(int)
Creates a content handler that writes XHTML body character events to an internal string buffer. The contents of the buffer can be retrieved using the ContentHandlerDecorator.toString() method.
The internal string buffer is bounded at the given number of characters. If this write limit is reached, then a SAXException is thrown.
writeLimit - maximum number of characters to include in the string, or -1 to disable the write limit
这里从未初始化缓冲区。
在我们的应用程序中,我们应该检查文件(任何格式)是否受密码保护, 为此,我们使用 Apache Tika API。 代码片段如下所示。
public static boolean isPasswordProtectedFile(File filePart) {
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try {
// parsing the file and testing for Password
parser.parse(FileUtils.openInputStream(filePart), handler, metadata, context);
LOGGER.debug("File is without Password ");
} catch (EncryptedDocumentException e) {
LOGGER.error("File is encrypted with password", e);
return true;
} catch (Exception e) {
LOGGER.error("File parsing failed", e);
}
return false;
}
但是对于我们测试的几个文件来说,这消耗了太多的 CPU。但是如果我们像下面这样创建 BodyContentHandler。然后它完成得更快并且使用更少 CPU。
BodyContentHandler handler = new BodyContentHandler(-1);
我仔细阅读了文档,但未能正确理解。期待一个可能的原因。提前致谢。
根据文档,它说
https://tika.apache.org/1.4/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler(int)
Creates a content handler that writes XHTML body character events to an internal string buffer. The contents of the buffer can be retrieved using the ContentHandlerDecorator.toString() method. The internal string buffer is bounded at the given number of characters. If this write limit is reached, then a SAXException is thrown.
writeLimit - maximum number of characters to include in the string, or -1 to disable the write limit
这里从未初始化缓冲区。