(如何)我可以使用 Apache Tika 在 .DOC 或 .PDF 或 .JAVA(等)文件中搜索短语吗?
(How) Can I use Apache Tika to search a .DOC or .PDF or .JAVA (etc.) file for a phrase?
Windows 7 当我正在搜索的驱动器被编入索引时,搜索很少对我有用。
自从我发现 Windows 7 没有 XP "search dog" 然后发现搜索几乎不可能而且几乎完全不可靠(即自 2010 年以来)以来,我一直很沮丧,我写道我在 Java 中的搜索程序名为 Searchy
。
但是虽然它允许复杂的文件名模式匹配(.DOC*, .PDF, .XL*, .TXT, .XML
是合法输入)Searchy
无法搜索 CONTENTS 文件中的单词和短语,例如 private protected
.
我找到 Apache Tika 并下载了一个 .jar
例程文件并将其导入 Netbeans 8.0.2 以便提供的示例程序 tika-example
下面(有点令人惊讶)编译。
link 中的这个简介让我觉得 Apache Tika 是我应该在 Searchy
中使用的:
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
我不知道如何巧妙地使用它,但如果我能弄清楚如何处理一个文件以查看它是否包含特定 String
,我想我将能够使该过程正常进行在 Searchy
中作为 class 中的一组方法我会创建。
tika-example
package org.apache.tika.example;
import java.io.File;
import org.apache.commons.io.FileUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.language.LanguageProfile;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
/**
* Demonstrates how to call the different components within Tika: its
* {@link Detector} framework (aka MIME identification and repository), its
* {@link Parser} interface, its {@link LanguageIdentifier} and other goodies.
*/
public class MyFirstTika {
public static void main(String[] args) throws Exception {
String filename = "Test.Docx";//args[0];
MimeTypes mimeRegistry = TikaConfig.getDefaultConfig()
.getMimeRepository();
System.out.println("Examining: [" + filename + "]");
System.out.println("The MIME type (based on filename) is: ["
+ mimeRegistry.getMimeType(filename) + "]");
System.out.println("The MIME type (based on MAGIC) is: ["
+ mimeRegistry.getMimeType(new File(filename)) + "]");
Detector mimeDetector = (Detector) mimeRegistry;
System.out
.println("The MIME type (based on the Detector interface) is: ["
+ mimeDetector.detect(new File(filename).toURI().toURL()
.openStream(), new Metadata()) + "]");
LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(
FileUtils.readFileToString(new File(filename))));
System.out.println("The language of this content is: ["
+ lang.getLanguage() + "]");
Parser parser = TikaConfig.getDefaultConfig().getParser(
MediaType.parse(mimeRegistry.getMimeType(filename).getName()));
Metadata parsedMet = new Metadata();
ContentHandler handler = new BodyContentHandler();
parser.parse(new File(filename).toURI().toURL().openStream(), handler,
parsedMet, new ParseContext());
System.out.println("Parsed Metadata: ");
System.out.println(parsedMet);
System.out.println("Parsed Text: ");
System.out.println(handler.toString());
}
}
虽然它确实可以编译,但出现运行时错误我并不感到惊讶:
run:
Examining: [Test.Docx]
The MIME type (based on filename) is: [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
The MIME type (based on MAGIC) is: [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
The MIME type (based on the Detector interface) is: [application/octet-stream]
The language of this content is: [lt]
Exception in thread "main" org.apache.tika.exception.TikaException: Error creating OOXML extractor
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:123)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at org.apache.tika.example.MyFirstTika.main(MyFirstTika.java:56)
Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:203)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:684)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:275)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:73)
... 2 more
Java Result: 1
因为出现以下错误,我提供了它打开的文件--Test.doc
其中有 3 行内容为 'Testing'.
Exception in thread "main" java.io.FileNotFoundException: C:\Users\Dov\Google Drive\NetBeansProjects\tika-example\tikaExample\Test.Doc (The system cannot find the file specified)
我在文件夹 C:\Users\Dov\Downloads\tika-1.9-src\tika-1.9\tika-example
中找到了 spring.xml
和 pom.xml
,但不知道如何处理它们。
spring.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">
<!--<start id="spring"/>-->
<bean id="tika" class="org.apache.tika.parser.AutoDetectParser">
<constructor-arg>
<list>
<ref bean="txt"/>
<ref bean="pdf"/>
</list>
</constructor-arg>
</bean>
<bean id="txt" class="org.apache.tika.parser.txt.TXTParser"/>
<bean id="pdf" class="org.apache.tika.parser.pdf.PDFParser"/>
<!--<end id="spring"/>-->
</beans>
pom.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>tika-parent</artifactId>
<groupId>org.apache.tika</groupId>
<version>1.9</version>
<relativePath>../tika-parent/pom.xml</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>tika-example</artifactId>
<name>Apache Tika examples</name>
<url>http://tika.apache.org/</url>
<description>This module contains examples of how to use Apache Tika.</description>
<organization>
<name>The Apache Software Foundation</name>
<url>http://www.apache.org</url>
</organization>
<scm>
<url>http://svn.apache.org/viewvc/tika/tags/1.9-rc2/tika-example</url>
<connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/tika-example</connection>
<developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.9-rc2/tika-example</developerConnection>
</scm>
<issueManagement>
<system>JIRA</system>
<url>https://issues.apache.org/jira/browse/TIKA</url>
</issueManagement>
<ciManagement>
<system>Jenkins</system>
<url>https://builds.apache.org/job/Tika-trunk/</url>
</ciManagement>
<!-- List of dependencies that we depend on for the examples. See the full list of Tika
modules and how to use them at http://mvnrepository.com/artifact/org.apache.tika.-->
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-app</artifactId>
<version>${project.version}</version>
<exclusions>
<exclusion>
<artifactId>tika-parsers</artifactId>
<groupId>org.apache.tika</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-serialization</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-translate</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>javax.jcr</groupId>
<artifactId>jcr</artifactId>
<version>2.0</version>
</dependency>
<dependency>
<groupId>org.apache.jackrabbit</groupId>
<artifactId>jackrabbit-jcr-server</artifactId>
<version>2.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.jackrabbit</groupId>
<artifactId>jackrabbit-core</artifactId>
<version>2.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>3.5.0</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
<version>3.0.2.RELEASE</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
</project>
任何有关错误的帮助或如何处理 Netbeans 中的 xml
文件以使 tika-example
程序正常工作将不胜感激。
我想出了如何巧妙地使用它。我得到它来为 .DOC、XLSX 和 .PDF 文件是否包含给定字符串提供正确的输出,因此显然不需要这两个 xml
文件。 (使用原始问题的导入。)
public class MyFirstTika {
public static boolean contains(File file, String s) throws MalformedURLException,
IOException, MimeTypeException, SAXException, TikaException{
ContentHandler handler = new BodyContentHandler();
MimeTypes mimeRegistry = TikaConfig.getDefaultConfig().getMimeRepository();
Detector mimeDetector = (Detector) mimeRegistry;
LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(file)));
Parser parser = TikaConfig.getDefaultConfig().getParser(MediaType.parse(mimeRegistry.getMimeType(file).getName()));
Metadata parsedMet = new Metadata();
parser.parse(file.toURI().toURL().openStream(), handler,parsedMet, new ParseContext());
System.out.println("Handler:\n\n******" + handler + "\n\n*****" );
return handler.toString().toLowerCase().contains(s.toLowerCase());
}
public static void main(String[] args) throws Exception
{
String searchString = "champion";
String filename = "schedule.pdf"; //test.docx";//"meds.xlsx";//Test2.Doc";
File file = new File(filename);
System.out.println(file + " contains " + searchString + ": "
+ contains(file, searchString));
}
}
示例输出:
Handler:
******
DUBLIN YOUTH ATHLETICS
Game Schedule 2014-2015
Girls 6th-8th Grade League
Dream
Game Day Date Gym Time Home (White) Visitor (Green)
1 Sunday 12/7/2014 Sells 4:00 PM Dream Sparks
7 Sunday 12/14/2014 Sells 2:00 PM Fever Dream
13 Sunday 1/4/2015 Sells 6:00 PM Stars Dream
Championship 3/8/2015
*****
schedule.pdf contains champion: true
Windows 7 当我正在搜索的驱动器被编入索引时,搜索很少对我有用。
自从我发现 Windows 7 没有 XP "search dog" 然后发现搜索几乎不可能而且几乎完全不可靠(即自 2010 年以来)以来,我一直很沮丧,我写道我在 Java 中的搜索程序名为 Searchy
。
但是虽然它允许复杂的文件名模式匹配(.DOC*, .PDF, .XL*, .TXT, .XML
是合法输入)Searchy
无法搜索 CONTENTS 文件中的单词和短语,例如 private protected
.
我找到 Apache Tika 并下载了一个 .jar
例程文件并将其导入 Netbeans 8.0.2 以便提供的示例程序 tika-example
下面(有点令人惊讶)编译。
link 中的这个简介让我觉得 Apache Tika 是我应该在 Searchy
中使用的:
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
我不知道如何巧妙地使用它,但如果我能弄清楚如何处理一个文件以查看它是否包含特定 String
,我想我将能够使该过程正常进行在 Searchy
中作为 class 中的一组方法我会创建。
tika-example
package org.apache.tika.example;
import java.io.File;
import org.apache.commons.io.FileUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.language.LanguageProfile;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
/**
* Demonstrates how to call the different components within Tika: its
* {@link Detector} framework (aka MIME identification and repository), its
* {@link Parser} interface, its {@link LanguageIdentifier} and other goodies.
*/
public class MyFirstTika {
public static void main(String[] args) throws Exception {
String filename = "Test.Docx";//args[0];
MimeTypes mimeRegistry = TikaConfig.getDefaultConfig()
.getMimeRepository();
System.out.println("Examining: [" + filename + "]");
System.out.println("The MIME type (based on filename) is: ["
+ mimeRegistry.getMimeType(filename) + "]");
System.out.println("The MIME type (based on MAGIC) is: ["
+ mimeRegistry.getMimeType(new File(filename)) + "]");
Detector mimeDetector = (Detector) mimeRegistry;
System.out
.println("The MIME type (based on the Detector interface) is: ["
+ mimeDetector.detect(new File(filename).toURI().toURL()
.openStream(), new Metadata()) + "]");
LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(
FileUtils.readFileToString(new File(filename))));
System.out.println("The language of this content is: ["
+ lang.getLanguage() + "]");
Parser parser = TikaConfig.getDefaultConfig().getParser(
MediaType.parse(mimeRegistry.getMimeType(filename).getName()));
Metadata parsedMet = new Metadata();
ContentHandler handler = new BodyContentHandler();
parser.parse(new File(filename).toURI().toURL().openStream(), handler,
parsedMet, new ParseContext());
System.out.println("Parsed Metadata: ");
System.out.println(parsedMet);
System.out.println("Parsed Text: ");
System.out.println(handler.toString());
}
}
虽然它确实可以编译,但出现运行时错误我并不感到惊讶:
run:
Examining: [Test.Docx]
The MIME type (based on filename) is: [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
The MIME type (based on MAGIC) is: [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
The MIME type (based on the Detector interface) is: [application/octet-stream]
The language of this content is: [lt]
Exception in thread "main" org.apache.tika.exception.TikaException: Error creating OOXML extractor
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:123)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at org.apache.tika.example.MyFirstTika.main(MyFirstTika.java:56)
Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:203)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:684)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:275)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:73)
... 2 more
Java Result: 1
因为出现以下错误,我提供了它打开的文件--Test.doc
其中有 3 行内容为 'Testing'.
Exception in thread "main" java.io.FileNotFoundException: C:\Users\Dov\Google Drive\NetBeansProjects\tika-example\tikaExample\Test.Doc (The system cannot find the file specified)
我在文件夹 C:\Users\Dov\Downloads\tika-1.9-src\tika-1.9\tika-example
中找到了 spring.xml
和 pom.xml
,但不知道如何处理它们。
spring.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">
<!--<start id="spring"/>-->
<bean id="tika" class="org.apache.tika.parser.AutoDetectParser">
<constructor-arg>
<list>
<ref bean="txt"/>
<ref bean="pdf"/>
</list>
</constructor-arg>
</bean>
<bean id="txt" class="org.apache.tika.parser.txt.TXTParser"/>
<bean id="pdf" class="org.apache.tika.parser.pdf.PDFParser"/>
<!--<end id="spring"/>-->
</beans>
pom.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>tika-parent</artifactId>
<groupId>org.apache.tika</groupId>
<version>1.9</version>
<relativePath>../tika-parent/pom.xml</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>tika-example</artifactId>
<name>Apache Tika examples</name>
<url>http://tika.apache.org/</url>
<description>This module contains examples of how to use Apache Tika.</description>
<organization>
<name>The Apache Software Foundation</name>
<url>http://www.apache.org</url>
</organization>
<scm>
<url>http://svn.apache.org/viewvc/tika/tags/1.9-rc2/tika-example</url>
<connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/tika-example</connection>
<developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.9-rc2/tika-example</developerConnection>
</scm>
<issueManagement>
<system>JIRA</system>
<url>https://issues.apache.org/jira/browse/TIKA</url>
</issueManagement>
<ciManagement>
<system>Jenkins</system>
<url>https://builds.apache.org/job/Tika-trunk/</url>
</ciManagement>
<!-- List of dependencies that we depend on for the examples. See the full list of Tika
modules and how to use them at http://mvnrepository.com/artifact/org.apache.tika.-->
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-app</artifactId>
<version>${project.version}</version>
<exclusions>
<exclusion>
<artifactId>tika-parsers</artifactId>
<groupId>org.apache.tika</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-serialization</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-translate</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>javax.jcr</groupId>
<artifactId>jcr</artifactId>
<version>2.0</version>
</dependency>
<dependency>
<groupId>org.apache.jackrabbit</groupId>
<artifactId>jackrabbit-jcr-server</artifactId>
<version>2.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.jackrabbit</groupId>
<artifactId>jackrabbit-core</artifactId>
<version>2.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>3.5.0</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
<version>3.0.2.RELEASE</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
</project>
任何有关错误的帮助或如何处理 Netbeans 中的 xml
文件以使 tika-example
程序正常工作将不胜感激。
我想出了如何巧妙地使用它。我得到它来为 .DOC、XLSX 和 .PDF 文件是否包含给定字符串提供正确的输出,因此显然不需要这两个 xml
文件。 (使用原始问题的导入。)
public class MyFirstTika {
public static boolean contains(File file, String s) throws MalformedURLException,
IOException, MimeTypeException, SAXException, TikaException{
ContentHandler handler = new BodyContentHandler();
MimeTypes mimeRegistry = TikaConfig.getDefaultConfig().getMimeRepository();
Detector mimeDetector = (Detector) mimeRegistry;
LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(file)));
Parser parser = TikaConfig.getDefaultConfig().getParser(MediaType.parse(mimeRegistry.getMimeType(file).getName()));
Metadata parsedMet = new Metadata();
parser.parse(file.toURI().toURL().openStream(), handler,parsedMet, new ParseContext());
System.out.println("Handler:\n\n******" + handler + "\n\n*****" );
return handler.toString().toLowerCase().contains(s.toLowerCase());
}
public static void main(String[] args) throws Exception
{
String searchString = "champion";
String filename = "schedule.pdf"; //test.docx";//"meds.xlsx";//Test2.Doc";
File file = new File(filename);
System.out.println(file + " contains " + searchString + ": "
+ contains(file, searchString));
}
}
示例输出:
Handler:
******
DUBLIN YOUTH ATHLETICS
Game Schedule 2014-2015
Girls 6th-8th Grade League
Dream
Game Day Date Gym Time Home (White) Visitor (Green)
1 Sunday 12/7/2014 Sells 4:00 PM Dream Sparks
7 Sunday 12/14/2014 Sells 2:00 PM Fever Dream
13 Sunday 1/4/2015 Sells 6:00 PM Stars Dream
Championship 3/8/2015
*****
schedule.pdf contains champion: true