需要使用 java 从任何二进制文件中提取文本

Need to extract text out of any binary file using java

我如何使用 java 解析二进制文件中的内容并从中提取文本。我需要它能够使用 lucene 索引二进制文件的内容。我目前支持的文件类型有pdf、html、word、excel、ppt、html.

你可以试试 Apache Tika:

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.