Bare bones Tika 类型检测器和 Apache POI

Bare bones Tika type detector and Apache POI

我正在使用 Apache Tika、1.7 和 Apache POI 从 Maven 构建的项目中的 .doc 和 docx 文档中提取文本。出于某种原因,我得到了

java.lang.NoSuchMethodError: org.apache.poi.util.IOUtils.calculateChecksum

错误。正如Apache POI FAQ, this is caused by a version problem. So the obvious solution would be to upgrade POI or something. The problem with this is that I am using the version of POI which is bundled with tika, in the tika-parsers package. This is because I am using the Tika type detector, which is the only part of Tika I am using (except for POI). The problem is that, if I use only the tika-core packages and declare the POI dependencies standalone in the maven pom.xml, the Tika detector stops detecting container types, like .docx files, because the tika-parsers package is necessary for the detector, as stated here中所说。那么,我该如何解决呢?我想用 tika 做准确的类型检测,但除了 Tika,我还想使用 Apache POI。

谢谢

我不知道您的 POM 是什么样子,但在大多数情况下,可以通过排除有问题的传递依赖项来解决此类问题。

看起来像这样:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.7</version>
    <exclusions>
        <exclusion>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
        </exclusion>
    </exclusions> 
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>3.11</version>
</dependency>

但是,查看 Tika 1.7 的 POM,它已经依赖于 POI 3.11,这是目前最新的版本,并且包含所需的方法。因此,很可能您在某个地方有另一个依赖项正在引入旧版本的 POI。

您可以使用 Maven dependency plugin 找到有问题的库,并使用上述技巧来解决冲突。