Tesseract - 错误 net.sourceforge.tess4j.Tesseract - 空

Tesseract - ERROR net.sourceforge.tess4j.Tesseract - null

创建了一个 java 应用程序,它使用 Tesseract 将给定的图像或 pdf 转换为字符串格式,当 运行 在我的机器上使用 junit 将其作为单元测试 运行 很棒,但是当 运行 宁整个系统时 restFul API 运行 通过 tomcat 接收图像和 运行 s Tesseract 它给我以下错误:

23:22:36.511 [http-nio-9999-exec-3] ERROR net.sourceforge.tess4j.Tesseract - null java.lang.NullPointerException: null at net.sourceforge.tess4j.util.PdfUtilities.convertPdf2Png(PdfUtilities.java:107) at net.sourceforge.tess4j.util.PdfUtilities.convertPdf2Tiff(PdfUtilities.java:48) at net.sourceforge.tess4j.util.ImageIOHelper.getIIOImageList(ImageIOHelper.java:343) at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:213) at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:197) at ocr.OcrUtil.getString(OcrUtil.java:54) at com.tapd.server.api.handlers.IRSHandler.uploadIRSImage(IRSHandler.java:65) at com.tapd.server.api.WebAPIService.updateParentIrsForm(WebAPIService.java:250) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime.run(ServerRuntime.java:309) at org.glassfish.jersey.internal.Errors.call(Errors.java:271) at org.glassfish.jersey.internal.Errors.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:292) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1139) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:460) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:386) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:334) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:108) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:522) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:620) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:349) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:1110) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:785) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1425) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Unknown Source) [2016-09-14 23:22:36,512] [ERROR] java.lang.NullPointerException

我的猜测是 tessdata 文件夹不在正确的位置,当打包到 Jar 中时 运行 tomcat 它放错了地方,但我不知道它在哪里应该位于并且我已经仔细检查以确保所有 Jar 都已正确部署。

编辑:所以看起来 Tesseract 在远程服务器(如 AWS S3)上时无法处理路径,所以问题是为什么?以及如何允许它使用来自 S3 的路径? (是的,文件是 public)

我的猜测是没有正确记录 GhostscriptException,这导致了 NullPointerException:

https://github.com/nguyenq/tess4j/blob/212d72bc2ec8b3a4d4f5a18f1eb01a0622fc5521/src/main/java/net/sourceforge/tess4j/util/PdfUtilities.java#L107

106        } catch (GhostscriptException e) {
107            logger.error(e.getCause().toString(), e);
108        } finally {

第 107 行 - e.getCause()(可能)为空,调用 null.toString() 会抛出 NPE。

(来自规范 - getCause 可以为空: https://docs.oracle.com/javase/7/docs/api/java/lang/Throwable.html#getCause(), GhostscriptException is also allowing the cause to be null: http://grepcode.com/file/repo1.maven.org/maven2/org.ghost4j/ghost4j/1.0.0/org/ghost4j/GhostscriptException.java)

要验证此答案(无需重新编译整个 tess4j),您可以在调试模式下启动程序并在第 107 行放置一个断点。这将为您提供有关真正异常的信息。

正如@Piotr R 提到的那样,错误是 ghostscriptException.getCause() 为空,原因是发送到 Tesseract 的文件对象中配置的路径不是有效路径,现在有效的定义因为 Tesseract 与您的有点不同,他只认为本地地址有效,因此当设置位于 AWS S3 上的文件时,即使它是 public 也会抛出错误。 解决方案是将其保存在本地并在 Tesseract 完成后将其删除。

我使用的资源:Windows 10(也在 Windows Server 2016 上试过),JAVA,MAVEN

状态:在本地和 VM 上运行良好

  1. 从此处下载 Tess4J-3.4.8 http://tess4j.sourceforge.net/ 并在高级系统设置下设置您的 ENV 变量路径
  2. 从 MAVEN 获取 repo -
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.5.1</version>
</dependency>
<dependency>
<groupId>org.ghost4j</groupId>
<artifactId>ghost4j</artifactId>
<version>1.0.1</version>
</dependency>
<dependency>
<groupId>net.sourceforge.lept4j</groupId>
<artifactId>lept4j</artifactId>
<version>1.7.0</version>
</dependency>
  1. 获取libtesseract302.dll并复制到“C:\Windows\System32”文件夹 从这里 http://api.256file.com/libtesseract302.dll/en-download-56466.html 不要忘记在 Advance System Setting

    下设置你的 ENV 变量路径
  2. 下载并安装 Visual C++ 2015 Redistributable 或 VC++ 2017 Redistributable(我都安装了) 从这里 https://programmer.help/blogs/net.sourceforge.tess4j.tesseractexception-java.lang.nullpointerexception.html

然后重启你的电脑

  1. 如果您在本地没有 Jar 文件,那么安全方面可以有一些 Jar 文件 - 请看图片

    不要忘记在高级系统设置下为 JAR 设置 ENV 变量路径