Tess4j OcrEngineMode 仅立方体:无效的内存访问
Tess4j OcrEngineMode CUBE ONLY : Invalid memory access
我要扫描图片(tif图片)获取编号。当 Tess4J 设置为默认引擎时,它经常混淆 6 为 5,0 为 9 ...我想用 CUBE ONLY 引擎尝试它。
这是我的配置文件:
tessedit_ocr_engine_mode 2
load_system_dawg F
load_freq_dawg F
load_punc_dawg F
load_number_dawg F
load_unambig_dawg F
load_bigram_dawg F
load_fixed_length_dawgs F
user_words_suffix user-words
user_patterns_suffix user-patterns
这是我的Java代码
public class App {
public static final String NUMBERS = "Oo0123456789";
public static final String TESSDATA_PATH_FOLDER = "D:/compuwork/ambienti/workspace_mars/ocrmaven/tessdata";
private static final Logger logger = LoggerFactory.getLogger(new LoggHelper().toString());
Tesseract1 instance;
String nomeCartella="005";
String path = "D:\Documenti\OCR\scansioni\"+nomeCartella;
String resultPath = "D:\Documenti\OCR\RISULTATI\"+nomeCartella;
String resultCorrettiPath = resultPath+"\"+"corretti";
String resultErratiPath = resultPath+"\"+"errati";
String tmpPath = resultPath+"\tmpImmagine";
String anotherCopy = resultPath+"\"+"ad";
String preScanPath = resultPath+"\prescan";
int validi = 0;
int nonValidi = 0;
public static void main( String[] args ) {
try {
//write my dictionary
File fileDir = new File(TESSDATA_PATH_FOLDER+"\"+"eng.user-words");
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
for(int i=120000; i<200000;i++) {
out.append(""+i).append("\r\n");
}
out.flush();
out.close();
//write my pattern
fileDir = new File(TESSDATA_PATH_FOLDER+"\"+"eng.user-patterns");
out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
out.append("\d\d\d\d\d\d");
out.flush();
out.close();
new App().testTesseractGlobalV();
System.exit(0);
} catch(Exception e) {
e.printStackTrace();
}
}
public App() {
instance = new Tesseract1();
instance.setLanguage("eng");
instance.setDatapath(TESSDATA_PATH_FOLDER);
instance.setPageSegMode(TessPageSegMode.PSM_AUTO);
instance.setTessVariable("tessedit_char_blacklist", "èéìà§ùòç$£&%éÎÉÈ");
instance.setTessVariable("file_type", ".tiff");
List<String> configs = Arrays.asList("myconfig");
instance.setConfigs(configs);
}
public void testTesseractGlobalV() {
File samples = new File(path);
//my Verify Result
Verificator verSerieV = new Verificator();
File outputFile = null;
BufferedImage bi = null;
int imgCount = 0;
for (File imageFile : samples.listFiles()) {
System.out.println("******* IMG "+imgCount+++" ******");
try {
bi = ImageIO.read(imageFile);
verSerieV.setRegionScan(new Rectangle(verSerieV.getRegionScan().x,verSerieV.getRegionScan().y,(int)(bi.getWidth() - verSerieV.getRegionScan().x), verSerieV.getRegionScan().height));
bi = ImageHelper.getSubImage(bi, verSerieV.getRegionScan().x, verSerieV.getRegionScan().y, verSerieV.getRegionScan().width, verSerieV.getRegionScan().height);
Binirization binarization = new Binirization(bi);
binarization.DoBinirization();
BufferedImage nuovaTest = binarization.getImg();
String nameFile = imageFile.getName();
File mFile = new File(preScanPath+"\"+nameFile);
ImageIO.write(nuovaTest,"tif", mFile);
System.out.println("scanning "+nameFile);
String result = instance.doOCR(nuovaTest); //throw java.lang.Error
...
这是完整的错误信息
Exception in thread "main" java.lang.Error: Invalid memory access
at net.sourceforge.tess4j.TessAPI1.TessBaseAPIInit1(Native Method)
at net.sourceforge.tess4j.Tesseract1.init(Tesseract1.java:338)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:247)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:231)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:212)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:196)
at tess4j.example.App.testTesseractGlobalV(App.java:158)
at tess4j.example.App.main(App.java:97)
init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file ..\..\ccmain\tessedit.cpp, line 209
我用的是eclipse,maven项目:
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.0.0</version>
</dependency>
您需要 download eng.cube.*
个数据文件并放入 tessdata
文件夹。
TESSDATA_PATH_FOLDER
应设置为 D:/compuwork/ambienti/workspace_mars/ocrmaven/
。
非常感谢 nguyenq!
我已经修复了我的 TESSDATA_PATH_FOLDER
并从你的 link 下载了一些丢失的文件 eng.cube.*
...
它适用于这个 myconfig:
tessedit_ocr_engine_mode 1
user_words_suffix user-words
user_patterns_suffix user-patterns
并从 App() 构造函数中删除 instance.setPageSegMode(TessPageSegMode.PSM_AUTO);
...
相反,尝试使用我的配置文件:
tessedit_ocr_engine_mode 1
load_system_dawg F
load_freq_dawg F
load_punc_dawg F
load_number_dawg F
load_unambig_dawg F
load_bigram_dawg F
load_fixed_length_dawgs F
user_words_suffix user-words
user_patterns_suffix user-patterns
并从 App() 构造函数中删除 instance.setPageSegMode(TessPageSegMode.PSM_AUTO)
,我有以下异常:
Exception in thread "main" java.lang.Error: Invalid memory access
at net.sourceforge.tess4j.TessAPI1.TessBaseAPIGetUTF8Text(Native Method)
at net.sourceforge.tess4j.Tesseract1.getOCRText(Tesseract1.java:402)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:258)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:231)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:212)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:196)
at tess4j.example.TestP.verify(TestP.java:96)
at tess4j.example.App.main(App.java:77)
我要扫描图片(tif图片)获取编号。当 Tess4J 设置为默认引擎时,它经常混淆 6 为 5,0 为 9 ...我想用 CUBE ONLY 引擎尝试它。
这是我的配置文件:
tessedit_ocr_engine_mode 2
load_system_dawg F
load_freq_dawg F
load_punc_dawg F
load_number_dawg F
load_unambig_dawg F
load_bigram_dawg F
load_fixed_length_dawgs F
user_words_suffix user-words
user_patterns_suffix user-patterns
这是我的Java代码
public class App {
public static final String NUMBERS = "Oo0123456789";
public static final String TESSDATA_PATH_FOLDER = "D:/compuwork/ambienti/workspace_mars/ocrmaven/tessdata";
private static final Logger logger = LoggerFactory.getLogger(new LoggHelper().toString());
Tesseract1 instance;
String nomeCartella="005";
String path = "D:\Documenti\OCR\scansioni\"+nomeCartella;
String resultPath = "D:\Documenti\OCR\RISULTATI\"+nomeCartella;
String resultCorrettiPath = resultPath+"\"+"corretti";
String resultErratiPath = resultPath+"\"+"errati";
String tmpPath = resultPath+"\tmpImmagine";
String anotherCopy = resultPath+"\"+"ad";
String preScanPath = resultPath+"\prescan";
int validi = 0;
int nonValidi = 0;
public static void main( String[] args ) {
try {
//write my dictionary
File fileDir = new File(TESSDATA_PATH_FOLDER+"\"+"eng.user-words");
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
for(int i=120000; i<200000;i++) {
out.append(""+i).append("\r\n");
}
out.flush();
out.close();
//write my pattern
fileDir = new File(TESSDATA_PATH_FOLDER+"\"+"eng.user-patterns");
out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
out.append("\d\d\d\d\d\d");
out.flush();
out.close();
new App().testTesseractGlobalV();
System.exit(0);
} catch(Exception e) {
e.printStackTrace();
}
}
public App() {
instance = new Tesseract1();
instance.setLanguage("eng");
instance.setDatapath(TESSDATA_PATH_FOLDER);
instance.setPageSegMode(TessPageSegMode.PSM_AUTO);
instance.setTessVariable("tessedit_char_blacklist", "èéìà§ùòç$£&%éÎÉÈ");
instance.setTessVariable("file_type", ".tiff");
List<String> configs = Arrays.asList("myconfig");
instance.setConfigs(configs);
}
public void testTesseractGlobalV() {
File samples = new File(path);
//my Verify Result
Verificator verSerieV = new Verificator();
File outputFile = null;
BufferedImage bi = null;
int imgCount = 0;
for (File imageFile : samples.listFiles()) {
System.out.println("******* IMG "+imgCount+++" ******");
try {
bi = ImageIO.read(imageFile);
verSerieV.setRegionScan(new Rectangle(verSerieV.getRegionScan().x,verSerieV.getRegionScan().y,(int)(bi.getWidth() - verSerieV.getRegionScan().x), verSerieV.getRegionScan().height));
bi = ImageHelper.getSubImage(bi, verSerieV.getRegionScan().x, verSerieV.getRegionScan().y, verSerieV.getRegionScan().width, verSerieV.getRegionScan().height);
Binirization binarization = new Binirization(bi);
binarization.DoBinirization();
BufferedImage nuovaTest = binarization.getImg();
String nameFile = imageFile.getName();
File mFile = new File(preScanPath+"\"+nameFile);
ImageIO.write(nuovaTest,"tif", mFile);
System.out.println("scanning "+nameFile);
String result = instance.doOCR(nuovaTest); //throw java.lang.Error
...
这是完整的错误信息
Exception in thread "main" java.lang.Error: Invalid memory access
at net.sourceforge.tess4j.TessAPI1.TessBaseAPIInit1(Native Method)
at net.sourceforge.tess4j.Tesseract1.init(Tesseract1.java:338)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:247)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:231)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:212)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:196)
at tess4j.example.App.testTesseractGlobalV(App.java:158)
at tess4j.example.App.main(App.java:97)
init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file ..\..\ccmain\tessedit.cpp, line 209
我用的是eclipse,maven项目:
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.0.0</version>
</dependency>
您需要 download eng.cube.*
个数据文件并放入 tessdata
文件夹。
TESSDATA_PATH_FOLDER
应设置为 D:/compuwork/ambienti/workspace_mars/ocrmaven/
。
非常感谢 nguyenq!
我已经修复了我的 TESSDATA_PATH_FOLDER
并从你的 link 下载了一些丢失的文件 eng.cube.*
...
它适用于这个 myconfig:
tessedit_ocr_engine_mode 1
user_words_suffix user-words
user_patterns_suffix user-patterns
并从 App() 构造函数中删除 instance.setPageSegMode(TessPageSegMode.PSM_AUTO);
...
相反,尝试使用我的配置文件:
tessedit_ocr_engine_mode 1
load_system_dawg F
load_freq_dawg F
load_punc_dawg F
load_number_dawg F
load_unambig_dawg F
load_bigram_dawg F
load_fixed_length_dawgs F
user_words_suffix user-words
user_patterns_suffix user-patterns
并从 App() 构造函数中删除 instance.setPageSegMode(TessPageSegMode.PSM_AUTO)
,我有以下异常:
Exception in thread "main" java.lang.Error: Invalid memory access
at net.sourceforge.tess4j.TessAPI1.TessBaseAPIGetUTF8Text(Native Method)
at net.sourceforge.tess4j.Tesseract1.getOCRText(Tesseract1.java:402)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:258)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:231)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:212)
at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:196)
at tess4j.example.TestP.verify(TestP.java:96)
at tess4j.example.App.main(App.java:77)