使用Tika 1.10 Parser获取文件内容

Question

我在尝试使用 Tika Parser 获取文件内容时遇到异常问题。下面的代码工作正常，有几种类型的文件输入（例如 doc、docx、txt、pdf），当在 JUnit 测试中运行时（即我能够获得每个文件的文本内容）。当我在我的应用程序中运行此代码时，没有返回任何文本。没有例外，只是来自 handler.toString().

的空字符串

public static String parseFile(final String path, final int charCountLimit) {

    if(path == null){
        throw new InvalidParameterException("parameter is null");
    } 

    if(charCountLimit < -1 || charCountLimit == 0){
        throw new InvalidParameterException("char count limit is out of range");
    }

    final File file = new File(path);

    if(! file.exists()){
        throw new InvalidParameterException(String.format("file does not exist %s", path));
    }

    try (InputStream stream = new FileInputStream(file.getAbsolutePath());){
        final AutoDetectParser parser = new AutoDetectParser();
        final BodyContentHandler handler = new BodyContentHandler(charCountLimit);

        Metadata metadata = new Metadata();
        /* the following setting is required for Office 2007 and later files, 
         * despite not being specified in the Tika Parser documentation
         */
        metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());

        parser.parse(stream, handler, metadata);
        return handler.toString();

    } catch (EncryptedDocumentException e){
        //handle exception
    } catch (IOException | SAXException | TikaException e) {
        //handle exception
    }
}

我的第一个想法是我的应用程序对我正在使用的文件做了一些事情，但是我已经通过对我的文件系统上的一个测试用例文件进行静态引用来排除这种情况。

我进一步的想法是我遇到了某种版本控制冲突。在我项目的 POM 中，我引用了 tika-core 的 v 1.10，但是父 POM 指定了 v 1.8。我已将父 POM 的引用更改为 1.10，但问题仍然存在。

    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>1.10</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>1.10</version>
    </dependency>

如果能提供解决此问题的建议，我将不胜感激。

更新

通过 http://wiki.apache.org/tika/Troubleshooting%20Tika#No_Content_Extracted 我发现所有解析器都丢失了。在 JUnit 中，

org.apache.tika.parser.DefaultParser

包含 58 个解析器。当运行在我的 JBoss 8 服务器上时，在应用程序中，DefaultParser 不包含解析器。添加 JVM 参数

-Dorg.apache.tika.service.error.warn=true

没有java.lang.NoClassDefFoundError表示无法加载解析器。

Answer 1

我解决了我的问题。该问题与包含我的 "parse file" jar 的 EAR 文件中的依赖项有关。

在我的 EAR 的 POM 中，已经有对 tika-core 的依赖引用。在运行时，EAR 的 tika-core 副本用于实例化 AutoDetectParser。由于我在 EAR 的 POM 中没有对 tika-parsers 的依赖引用，因此无法加载解析器类.

所以，问题似乎是由不正确的 Maven POM 依赖项配置引起的，由于默认情况下 DefaultParser（由 AutoDetectParser 获得）在默认情况下不会生成任何输出（或抛出异常）找不到任何解析器。

使用Tika 1.10 Parser获取文件内容

Using Tika 1.10 Parser to obtain file content

java

jboss

maven

apache-tika