使用 Apache Tika 解析 iWorksDocument 时出现问题

Question

我试图用 Apache Tika[解析 iWorksDoc。但是我没有得到解析的内容，因为它是从内容处理程序中得到一些其他输出。下面添加了我使用过的代码片段和我得到的输出。

    private void parseFile(File file) {
    try{
        File file = new File("/home/user/tika/samples/budget.numbers");
        FileInputStream inputStream = new FileInputStream(file);
        ParseContext context = new ParseContext();
        BodyContentHandler bodyHandler = new BodyContentHandler(-1);
        Parser parser=new AutoDetectParser();
        parser.parse(inputStream, bodyHandler, new Metadata(), context);
        System.out.println("Contents of the file :"+bodyHandler.toString());
        }
        catch(IOException | SAXException | TikaException e){
            e.printStackTrace();
        }
}

输出：-

Contents of the file :
Index/Document.iwa
Index/ViewState.iwa
Index/CalculationEngine.iwa
Index/Tables/HeaderStorageBucket-2.iwa
Index/Tables/Tile.iwa
Index/Metadata.iwa
Metadata/Properties.plist

我能够使用检测器 api 正确检测文件类型。但是我没有从文档中获取有用的内容。请帮忙！

Answer 1

Tika 应该能够解析 Numbers 文档。如果您能够共享文档，请 post 将其发送到我们的 Jira。当我查看解析器时，我们可以更稳健地处理命名空间，可能是问题所在，但没有文档我无法判断。

使用 Apache Tika 解析 iWorksDocument 时出现问题

Issue in parsing iWorksDocument with Apache Tika

java

lucene

text-extraction

apache-tika