无法使用 apache tika 直接从扫描的 pdf 中提取内容，但在转换为 jpg 格式时工作正常

Question

我无法从下面附上的 pdf 形式的图像中提取内容，但是当我将其转换为 jpg 格式时它工作正常。我的问题是我有大量扫描的 pdf，其中包含多个扫描页面。我想看看是否有直接的方法来提取内容，而不是将 pdf 转换为 jpg 然后提取文本的开销。我遵循了 link

提供的解决方案

pdf version of doc is pdfversion

My java version "1.8.0_112", tesseract 3.04.01, leptonica-1.74.1, libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8

pom.xml 有

<dependencies>
    <dependency>
        <groupId>net.sourceforge.tess4j</groupId>
        <artifactId>tess4j</artifactId>
        <version>3.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>1.14</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>1.14</version>
    </dependency>
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.5</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
    <dependency>
        <groupId>com.github.jai-imageio</groupId>
        <artifactId>jai-imageio-core</artifactId>
        <version>1.3.1</version>
    </dependency>
    <dependency>
        <groupId>net.java.dev.jna</groupId>
        <artifactId>jna</artifactId>
        <version>4.2.2</version>
    </dependency>
    <dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>1.2.11</version>
    </dependency>
    <dependency>
        <groupId>com.levigo.jbig2</groupId>
        <artifactId>levigo-jbig2-imageio</artifactId>
        <version>1.6.5</version>
    </dependency>

</dependencies>

java代码

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class Sample {
    public static void main(String[] args)
            throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
        TesseractOCRConfig config = new TesseractOCRConfig();
        config.setTesseractPath("/usr/local/bin/");
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);
        pdfConfig.setExtractUniqueInlineImagesOnly(false);
        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        parseContext.set(Parser.class, parser);
        FileInputStream stream = new FileInputStream(new File("path2pdf.pdf"));
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        System.out.println(metadata);
        String content = handler.toString();
        System.out.println("===============");
        System.out.println(content);
        System.out.println("Done");
    }
}

但是没有用，如果我在这里做错了，请指教。

Answer 1

问题似乎是，如果未明确配置参数，Tika 调用 tesseract（一旦它验证了二进制文件存在并且可以执行）而没有在环境中指定 tessdata 目录的位置设置（这个默认值可能适用于某些安装，但不适用于我的 Mac）。可以按照以下方式明确设置路径：

      TesseractOCRConfig config = new TesseractOCRConfig();
      config.setTesseractPath("/usr/local/bin");
      config.setTessdataPath("/usr/local/share");

然后这会产生预期的结果（至少在 MacOS X 上通过自制软件安装 tesseract）：

1 An Introduction to Conditional Random Fields for Relational Learning

Charles Sutton

Department of Computer Science University of Massachusetts, USA casutton-@cs.umass.edu http://www.cs.umass.edu/~casutton

Andrew McCallum

Department of Computer Science University of Massachusetts, USA mccallum@cs.umass.edu http://www.cs.umass.edu/~mccallum

1.1 Introduction

Relational data has two characteristics: ﬁrst, statistical dependencies exist between the entities we wish to model, and second, each entity often has a rich set of features that can aid classiﬁcation. For example, when classifying Web documents. the page’s text provides much information about the class label. but hyperlinks deﬁne a relationship between pages that can improve classiﬁcation [Taskar et al.. 2002]. Graphical models are a natural formalism for exploiting the dependence structure among entities. Traditionally, graphical models have been used to represent the joint probability distribution p(y, x), where the variables y represent the attributes of the entities that we wish to predict, and the input variables x represent our observed knowledge about the entities. But modeling the joint distribution can lead to difﬁculties when using the rich local features that can occur in relational data. because it requires modeling the distribution p(x), which can include complex dependencies. Modeling these dependencies among inputs can lead to intractable models, but ignoring them can lead to reduced performance.

A solution to this problem is to directly model the conditional distribution p(y]x), which is sufﬁcient for classiﬁcation. This is the approach taken by conditional ran- dom ﬁelds [Lafferty ct al., 2001]. A conditional random ﬁeld is simply a conditional distribution p(ylx) with an associated graphical structure. Because the model is

无法使用 apache tika 直接从扫描的 pdf 中提取内容，但在转换为 jpg 格式时工作正常

Unable to extract content directly from scanned pdf using apache tika , but works fine when converted to jpg format

pdfbox

apache-tika