(如何)我可以使用 Apache Tika 在 .DOC 或 .PDF 或 .JAVA(等)文件中搜索短语吗?

(How) Can I use Apache Tika to search a .DOC or .PDF or .JAVA (etc.) file for a phrase?

Windows 7 当我正在搜索的驱动器被编入索引时,搜索很少对我有用。

自从我发现 Windows 7 没有 XP "search dog" 然后发现搜索几乎不可能而且几乎完全不可靠(即自 2010 年以来)以来,我一直很沮丧,我写道我在 Java 中的搜索程序名为 Searchy

但是虽然它允许复杂的文件名模式匹配(.DOC*, .PDF, .XL*, .TXT, .XML 是合法输入)Searchy 无法搜索 CONTENTS 文件中的单词和短语,例如 private protected.

我找到 Apache Tika 并下载了一个 .jar 例程文件并将其导入 Netbeans 8.0.2 以便提供的示例程序 tika-example 下面(有点令人惊讶)编译。

link 中的这个简介让我觉得 Apache Tika 是我应该在 Searchy 中使用的:

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

我不知道如何巧妙地使用它,但如果我能弄清楚如何处理一个文件以查看它是否包含特定 String,我想我将能够使该过程正常进行在 Searchy 中作为 class 中的一组方法我会创建。

tika-example

package org.apache.tika.example;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.language.LanguageProfile;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

/**
 * Demonstrates how to call the different components within Tika: its
 * {@link Detector} framework (aka MIME identification and repository), its
 * {@link Parser} interface, its {@link LanguageIdentifier} and other goodies.
 */

public class MyFirstTika {

    public static void main(String[] args) throws Exception {
        String filename = "Test.Docx";//args[0];
        MimeTypes mimeRegistry = TikaConfig.getDefaultConfig()
                .getMimeRepository();

        System.out.println("Examining: [" + filename + "]");

        System.out.println("The MIME type (based on filename) is: ["
                + mimeRegistry.getMimeType(filename) + "]");

        System.out.println("The MIME type (based on MAGIC) is: ["
                + mimeRegistry.getMimeType(new File(filename)) + "]");

        Detector mimeDetector = (Detector) mimeRegistry;
        System.out
                .println("The MIME type (based on the Detector interface) is: ["
                        + mimeDetector.detect(new File(filename).toURI().toURL()
                                .openStream(), new Metadata()) + "]");

        LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(
                FileUtils.readFileToString(new File(filename))));

        System.out.println("The language of this content is: ["
                + lang.getLanguage() + "]");

        Parser parser = TikaConfig.getDefaultConfig().getParser(
                MediaType.parse(mimeRegistry.getMimeType(filename).getName()));

    Metadata parsedMet = new Metadata();
        ContentHandler handler = new BodyContentHandler();
        parser.parse(new File(filename).toURI().toURL().openStream(), handler,
                parsedMet, new ParseContext());

        System.out.println("Parsed Metadata: ");
        System.out.println(parsedMet);
        System.out.println("Parsed Text: ");
        System.out.println(handler.toString());

    }
}

虽然它确实可以编译,但出现运行时错误我并不感到惊讶:

run:
Examining: [Test.Docx]
The MIME type (based on filename) is: [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
The MIME type (based on MAGIC) is: [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
The MIME type (based on the Detector interface) is: [application/octet-stream]
The language of this content is: [lt]
Exception in thread "main" org.apache.tika.exception.TikaException: Error creating OOXML extractor
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:123)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
    at org.apache.tika.example.MyFirstTika.main(MyFirstTika.java:56)
Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]
    at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:203)
    at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:684)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:275)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:73)
    ... 2 more
Java Result: 1

因为出现以下错误,我提供了它打开的文件--Test.doc 其中有 3 行内容为 'Testing'.

Exception in thread "main" java.io.FileNotFoundException: C:\Users\Dov\Google Drive\NetBeansProjects\tika-example\tikaExample\Test.Doc (The system cannot find the file specified)

我在文件夹 C:\Users\Dov\Downloads\tika-1.9-src\tika-1.9\tika-example 中找到了 spring.xmlpom.xml,但不知道如何处理它们。

spring.xml:

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans
                           http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">

<!--<start id="spring"/>-->
  <bean id="tika" class="org.apache.tika.parser.AutoDetectParser">
    <constructor-arg>
        <list>
           <ref bean="txt"/>
           <ref bean="pdf"/>
        </list>
    </constructor-arg>
  </bean>

  <bean id="txt" class="org.apache.tika.parser.txt.TXTParser"/>
  <bean id="pdf" class="org.apache.tika.parser.pdf.PDFParser"/>
<!--<end id="spring"/>-->

</beans>

pom.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <parent>
        <artifactId>tika-parent</artifactId>
        <groupId>org.apache.tika</groupId>
        <version>1.9</version>
        <relativePath>../tika-parent/pom.xml</relativePath>
      </parent>
      <modelVersion>4.0.0</modelVersion>

      <artifactId>tika-example</artifactId>

      <name>Apache Tika examples</name>
      <url>http://tika.apache.org/</url>

      <description>This module contains examples of how to use Apache Tika.</description>
      <organization>
        <name>The Apache Software Foundation</name>
        <url>http://www.apache.org</url>
      </organization>

      <scm>
        <url>http://svn.apache.org/viewvc/tika/tags/1.9-rc2/tika-example</url>
        <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/tika-example</connection>
        <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.9-rc2/tika-example</developerConnection>
      </scm>

      <issueManagement>
        <system>JIRA</system>
        <url>https://issues.apache.org/jira/browse/TIKA</url>
      </issueManagement>

      <ciManagement>
        <system>Jenkins</system>
        <url>https://builds.apache.org/job/Tika-trunk/</url>
      </ciManagement>

      <!-- List of dependencies that we depend on for the examples. See the full list of Tika
           modules and how to use them at http://mvnrepository.com/artifact/org.apache.tika.-->
      <dependencies>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-app</artifactId>
            <version>${project.version}</version>
            <exclusions>
              <exclusion>
                <artifactId>tika-parsers</artifactId>
                <groupId>org.apache.tika</groupId>
              </exclusion>
            </exclusions>
        </dependency>  
        <dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-parsers</artifactId>
          <version>${project.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-serialization</artifactId>
          <version>${project.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-translate</artifactId>
          <version>${project.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-parsers</artifactId>
          <version>${project.version}</version>
          <type>test-jar</type>
          <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>javax.jcr</groupId>
            <artifactId>jcr</artifactId>
            <version>2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.jackrabbit</groupId>
            <artifactId>jackrabbit-jcr-server</artifactId>
            <version>2.3.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.jackrabbit</groupId>
            <artifactId>jackrabbit-core</artifactId>
            <version>2.3.6</version>
        </dependency>       
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>3.5.0</version>
        </dependency>   
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-context</artifactId>
            <version>3.0.2.RELEASE</version>
        </dependency>
        <dependency>
          <groupId>junit</groupId>
          <artifactId>junit</artifactId>
          <scope>test</scope>
        </dependency>
      </dependencies>
    </project>

任何有关错误的帮助或如何处理 Netbeans 中的 xml 文件以使 tika-example 程序正常工作将不胜感激。

我想出了如何巧妙地使用它。我得到它来为 .DOC、XLSX 和 .PDF 文件是否包含给定字符串提供正确的输出,因此显然不需要这两个 xml 文件。 (使用原始问题的导入。)

    public class MyFirstTika {

      public static boolean contains(File file, String s) throws MalformedURLException, 
         IOException, MimeTypeException, SAXException, TikaException{

        ContentHandler handler = new BodyContentHandler();

        MimeTypes mimeRegistry = TikaConfig.getDefaultConfig().getMimeRepository();

        Detector mimeDetector = (Detector) mimeRegistry;

        LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(file)));

        Parser parser = TikaConfig.getDefaultConfig().getParser(MediaType.parse(mimeRegistry.getMimeType(file).getName()));

        Metadata parsedMet = new Metadata();

        parser.parse(file.toURI().toURL().openStream(), handler,parsedMet, new ParseContext());

        System.out.println("Handler:\n\n******" + handler + "\n\n*****" );
        return handler.toString().toLowerCase().contains(s.toLowerCase());
      }

      public static void main(String[] args) throws Exception 
      {
        String searchString = "champion";
        String filename = "schedule.pdf"; //test.docx";//"meds.xlsx";//Test2.Doc";
        File file = new File(filename);

        System.out.println(file + " contains " + searchString + ": " 
                + contains(file, searchString));
        }
    }

示例输出:

    Handler:
    ******
    DUBLIN YOUTH ATHLETICS
    Game Schedule  2014-2015
    Girls 6th-8th Grade League

    Dream

    Game Day Date Gym Time Home (White) Visitor (Green)
    1 Sunday 12/7/2014 Sells 4:00 PM Dream Sparks

    7 Sunday 12/14/2014 Sells 2:00 PM Fever Dream

    13 Sunday 1/4/2015 Sells 6:00 PM Stars Dream

    Championship 3/8/2015

    *****

    schedule.pdf contains champion: true