无法使用 apache tika 直接从扫描的 pdf 中提取内容,但在转换为 jpg 格式时工作正常
Unable to extract content directly from scanned pdf using apache tika , but works fine when converted to jpg format
我无法从下面附上的 pdf 形式的图像中提取内容,但是当我将其转换为 jpg 格式时它工作正常。我的问题是我有大量扫描的 pdf,其中包含多个扫描页面。我想看看是否有直接的方法来提取内容,而不是将 pdf 转换为 jpg 然后提取文本的开销。我遵循了 link
提供的解决方案
pdf version of doc is pdfversion
My java version "1.8.0_112", tesseract 3.04.01, leptonica-1.74.1,
libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8
pom.xml 有
<dependencies>
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.14</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.14</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-core</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna</artifactId>
<version>4.2.2</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.11</version>
</dependency>
<dependency>
<groupId>com.levigo.jbig2</groupId>
<artifactId>levigo-jbig2-imageio</artifactId>
<version>1.6.5</version>
</dependency>
</dependencies>
java代码
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class Sample {
public static void main(String[] args)
throws IOException, TikaException, SAXException {
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath("/usr/local/bin/");
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(false);
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser);
FileInputStream stream = new FileInputStream(new File("path2pdf.pdf"));
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata, parseContext);
System.out.println(metadata);
String content = handler.toString();
System.out.println("===============");
System.out.println(content);
System.out.println("Done");
}
}
但是没有用,如果我在这里做错了,请指教。
问题似乎是,如果未明确配置参数,Tika 调用 tesseract(一旦它验证了二进制文件存在并且可以执行)而没有在环境中指定 tessdata
目录的位置设置(这个默认值可能适用于某些安装,但不适用于我的 Mac)。可以按照以下方式明确设置路径:
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath("/usr/local/bin");
config.setTessdataPath("/usr/local/share");
然后这会产生预期的结果(至少在 MacOS X 上通过自制软件安装 tesseract):
1 An Introduction to Conditional Random Fields for Relational Learning
Charles Sutton
Department of Computer Science University of Massachusetts, USA
casutton-@cs.umass.edu http://www.cs.umass.edu/~casutton
Andrew McCallum
Department of Computer Science University of Massachusetts, USA
mccallum@cs.umass.edu http://www.cs.umass.edu/~mccallum
1.1 Introduction
Relational data has two characteristics: first, statistical
dependencies exist between the entities we wish to model, and second,
each entity often has a rich set of features that can aid
classification. For example, when classifying Web documents. the page’s
text provides much information about the class label. but hyperlinks
define a relationship between pages that can improve classification
[Taskar et al.. 2002]. Graphical models are a natural formalism for
exploiting the dependence structure among entities. Traditionally,
graphical models have been used to represent the joint probability
distribution p(y, x), where the variables y represent the attributes
of the entities that we wish to predict, and the input variables x
represent our observed knowledge about the entities. But modeling the
joint distribution can lead to difficulties when using the rich local
features that can occur in relational data. because it requires
modeling the distribution p(x), which can include complex
dependencies. Modeling these dependencies among inputs can lead to
intractable models, but ignoring them can lead to reduced performance.
A solution to this problem is to directly model the conditional
distribution p(y]x), which is sufficient for classification. This is the
approach taken by conditional ran- dom fields [Lafferty ct al., 2001].
A conditional random field is simply a conditional distribution p(ylx)
with an associated graphical structure. Because the model is
我无法从下面附上的 pdf 形式的图像中提取内容,但是当我将其转换为 jpg 格式时它工作正常。我的问题是我有大量扫描的 pdf,其中包含多个扫描页面。我想看看是否有直接的方法来提取内容,而不是将 pdf 转换为 jpg 然后提取文本的开销。我遵循了 link
提供的解决方案pdf version of doc is pdfversion
My java version "1.8.0_112", tesseract 3.04.01, leptonica-1.74.1, libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8
pom.xml 有
<dependencies>
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.14</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.14</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-core</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna</artifactId>
<version>4.2.2</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.11</version>
</dependency>
<dependency>
<groupId>com.levigo.jbig2</groupId>
<artifactId>levigo-jbig2-imageio</artifactId>
<version>1.6.5</version>
</dependency>
</dependencies>
java代码
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class Sample {
public static void main(String[] args)
throws IOException, TikaException, SAXException {
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath("/usr/local/bin/");
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(false);
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser);
FileInputStream stream = new FileInputStream(new File("path2pdf.pdf"));
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata, parseContext);
System.out.println(metadata);
String content = handler.toString();
System.out.println("===============");
System.out.println(content);
System.out.println("Done");
}
}
但是没有用,如果我在这里做错了,请指教。
问题似乎是,如果未明确配置参数,Tika 调用 tesseract(一旦它验证了二进制文件存在并且可以执行)而没有在环境中指定 tessdata
目录的位置设置(这个默认值可能适用于某些安装,但不适用于我的 Mac)。可以按照以下方式明确设置路径:
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath("/usr/local/bin");
config.setTessdataPath("/usr/local/share");
然后这会产生预期的结果(至少在 MacOS X 上通过自制软件安装 tesseract):
1 An Introduction to Conditional Random Fields for Relational Learning
Charles Sutton
Department of Computer Science University of Massachusetts, USA casutton-@cs.umass.edu http://www.cs.umass.edu/~casutton
Andrew McCallum
Department of Computer Science University of Massachusetts, USA mccallum@cs.umass.edu http://www.cs.umass.edu/~mccallum
1.1 Introduction
Relational data has two characteristics: first, statistical dependencies exist between the entities we wish to model, and second, each entity often has a rich set of features that can aid classification. For example, when classifying Web documents. the page’s text provides much information about the class label. but hyperlinks define a relationship between pages that can improve classification [Taskar et al.. 2002]. Graphical models are a natural formalism for exploiting the dependence structure among entities. Traditionally, graphical models have been used to represent the joint probability distribution p(y, x), where the variables y represent the attributes of the entities that we wish to predict, and the input variables x represent our observed knowledge about the entities. But modeling the joint distribution can lead to difficulties when using the rich local features that can occur in relational data. because it requires modeling the distribution p(x), which can include complex dependencies. Modeling these dependencies among inputs can lead to intractable models, but ignoring them can lead to reduced performance.
A solution to this problem is to directly model the conditional distribution p(y]x), which is sufficient for classification. This is the approach taken by conditional ran- dom fields [Lafferty ct al., 2001]. A conditional random field is simply a conditional distribution p(ylx) with an associated graphical structure. Because the model is