Tess-Two(Android 中的 Tesseract OCR)显示的结果非常不准确
Tess-Two (Tesseract OCR in Android) shows very inaccurate results
我使用以下函数使用 Tesseract OCR 的 Android 分支 Tess-Two 执行离线 OCR:
private String startOCR(Uri imgUri) {
try {
ExifInterface exif = new ExifInterface(imgUri.getPath());
int exifOrientation = exif.getAttributeInt(ExifInterface.TAG_ORIENTATION, ExifInterface.ORIENTATION_NORMAL);
int rotate = 0;
switch(exifOrientation) {
case ExifInterface.ORIENTATION_ROTATE_90:
rotate = 90;
break;
case ExifInterface.ORIENTATION_ROTATE_180:
rotate = 180;
break;
case ExifInterface.ORIENTATION_ROTATE_270:
rotate = 270;
break;
}
Log.d(TAG, "Rotation: " + rotate);
BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 4; // 1 - means max size. 4 - means maxsize/4 size. Don't use value <4, because you need more memory in the heap to store your data.
// set to 300 dpi
options.inTargetDensity = 300;
Bitmap bitmap = BitmapFactory.decodeFile(imgUri.getPath(), options);
// Change Orientation via EXIF
if (rotate != 0) {
// Getting width & height of the given image.
int w = bitmap.getWidth();
int h = bitmap.getHeight();
// Setting pre rotate
Matrix mtx = new Matrix();
mtx.preRotate(rotate);
// Rotating Bitmap
bitmap = Bitmap.createBitmap(bitmap, 0, 0, w, h, mtx, false);
}
// To Grayscale
bitmap = toGrayscale(bitmap);
final Bitmap b = bitmap;
final ImageView ivResult = (ImageView)findViewById(R.id.ivResult);
if(ivResult != null) {
runOnUiThread(new Runnable() {
@Override
public void run() {
ivResult.setImageBitmap(b);
}
});
}
return extractText(bitmap);
} catch (Exception e) {
Log.e(TAG, e.getMessage());
return "";
}
}
这里是 extractText()
方法:
private String extractText(Bitmap bitmap) {
//Log.d(TAG, "extractText");
try {
tessBaseApi = new TessBaseAPI();
} catch (Exception e) {
Log.e(TAG, e.getMessage());
if (tessBaseApi == null) {
Log.e(TAG, "TessBaseAPI is null. TessFactory not returning tess object.");
}
}
tessBaseApi.init(DATA_PATH, lang);
//EXTRA SETTINGS
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "abcdefghijklmnopqrstuvwxyz1234567890',.?;/ ");
Log.d(TAG, "Training file loaded");
tessBaseApi.setDebug(true);
tessBaseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_AUTO_OSD);
tessBaseApi.setImage(bitmap);
String extractedText = "empty result";
try {
extractedText = tessBaseApi.getUTF8Text();
} catch (Exception e) {
Log.e(TAG, "Error in recognizing text.");
}
tessBaseApi.end();
return extractedText;
}
extractText()
返回的值如下图所示:
虽然我在执行 OCR 之前将图像灰度和放大到 300 dpi,但准确度非常低。我怎样才能改善结果?是不是训练出来的数据不够好?
我做了一些测试,但是,我有一些观点和结论可以改善你的结果。
- 尝试在您的 VAR_WHITE_CHARLIST 变量参数中传递小写和大写字母:
查看此输入的结果:
a) 仅限小写:
参数:
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "abcdefghijklmnopqrstuvwxyz1234567890',.?;/ ");
结果:
05 atenienses nnito, hdeleto e laicao, os principais acusadores de
gocrates, nao defendiam apenas que o filosofo corrompia a juventude;
eles lutavam tama bern pelas virtudes da tradigao poetica vinculada a
liornero. nristofanes, um dos responsaveis, segundo socrates, dos
preconceitos contra o filosofo, era outro grande defensor dessa
virtude.
socrates, de certa forma, estava em guerra com a tradieao poetica
grega. 0 metodo de socrates era o oposto a narrativa epica de
tlornero. sua dialetica nao tinha nada de semideuses corn superpoderes
6
b) 大小写字母:
参数:
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ1234567890',.?;/ ");
结果:
Os atenienses Anito, Meleto e Licao, os principais acusadores de
Socrates, nao defendiam apenas que o filosofo corrompia a juventude;
eles lutavam tama bern pelas virtudes da tradigao poetica vinculada a
Homero. Aristofanes, um dos responsaveis, segundo socrates, dos
preconceitos contra o filosofo, era outro grande defensor dessa
virtude.
socrates, de certa forma, estava em guerra com a tradieao poetica
grega. O metodo de socrates era o Oposto a narrativa epica de Homero.
Sua dialetica nao tinha nada de semideuses corn superpoderes 6
PS:我 运行 这个例子使用的是葡萄牙语,检查一些需要不同字符的词,例如:'é ó ç' 它没有用,因为它不是'作为 char 传递到白名单中。
我也试过运行用你的图片,结果有改善(没那么多):
Font 20; Which polrlrcran has caplured Ihe curve, summed up a growing
mood. In a Ierocrous speech? 'Your iron industry is dead. dead as
munon. Your coal yum mono greatly on the iron Vbur Ilk Mary is and. o
Your woolen induslry is Why. Your canon Mr Wilding induslry. blmailf
所以我检查了 tesseract 如何将图像二值化:
您的图片有太多噪点,然后 api 尝试对您的图片进行二值化处理,导致图片的很大一部分难以辨认。我建议你再试一次运行,但不要通过灰度,并尝试研究如何减少图像中的噪点。
为了帮助您完成调试任务,您可以保存阈值图像:
WriteFile.writeBitmap(baseApi.getThresholdedImage())
希望对你有用!感谢您分享您的问题!
Abraços!
在这一行
options.inSampleSize = 4;
将数字从 4 更改为 1,然后再次尝试进行 ocr
我使用以下函数使用 Tesseract OCR 的 Android 分支 Tess-Two 执行离线 OCR:
private String startOCR(Uri imgUri) {
try {
ExifInterface exif = new ExifInterface(imgUri.getPath());
int exifOrientation = exif.getAttributeInt(ExifInterface.TAG_ORIENTATION, ExifInterface.ORIENTATION_NORMAL);
int rotate = 0;
switch(exifOrientation) {
case ExifInterface.ORIENTATION_ROTATE_90:
rotate = 90;
break;
case ExifInterface.ORIENTATION_ROTATE_180:
rotate = 180;
break;
case ExifInterface.ORIENTATION_ROTATE_270:
rotate = 270;
break;
}
Log.d(TAG, "Rotation: " + rotate);
BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 4; // 1 - means max size. 4 - means maxsize/4 size. Don't use value <4, because you need more memory in the heap to store your data.
// set to 300 dpi
options.inTargetDensity = 300;
Bitmap bitmap = BitmapFactory.decodeFile(imgUri.getPath(), options);
// Change Orientation via EXIF
if (rotate != 0) {
// Getting width & height of the given image.
int w = bitmap.getWidth();
int h = bitmap.getHeight();
// Setting pre rotate
Matrix mtx = new Matrix();
mtx.preRotate(rotate);
// Rotating Bitmap
bitmap = Bitmap.createBitmap(bitmap, 0, 0, w, h, mtx, false);
}
// To Grayscale
bitmap = toGrayscale(bitmap);
final Bitmap b = bitmap;
final ImageView ivResult = (ImageView)findViewById(R.id.ivResult);
if(ivResult != null) {
runOnUiThread(new Runnable() {
@Override
public void run() {
ivResult.setImageBitmap(b);
}
});
}
return extractText(bitmap);
} catch (Exception e) {
Log.e(TAG, e.getMessage());
return "";
}
}
这里是 extractText()
方法:
private String extractText(Bitmap bitmap) {
//Log.d(TAG, "extractText");
try {
tessBaseApi = new TessBaseAPI();
} catch (Exception e) {
Log.e(TAG, e.getMessage());
if (tessBaseApi == null) {
Log.e(TAG, "TessBaseAPI is null. TessFactory not returning tess object.");
}
}
tessBaseApi.init(DATA_PATH, lang);
//EXTRA SETTINGS
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "abcdefghijklmnopqrstuvwxyz1234567890',.?;/ ");
Log.d(TAG, "Training file loaded");
tessBaseApi.setDebug(true);
tessBaseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_AUTO_OSD);
tessBaseApi.setImage(bitmap);
String extractedText = "empty result";
try {
extractedText = tessBaseApi.getUTF8Text();
} catch (Exception e) {
Log.e(TAG, "Error in recognizing text.");
}
tessBaseApi.end();
return extractedText;
}
extractText()
返回的值如下图所示:
虽然我在执行 OCR 之前将图像灰度和放大到 300 dpi,但准确度非常低。我怎样才能改善结果?是不是训练出来的数据不够好?
我做了一些测试,但是,我有一些观点和结论可以改善你的结果。
- 尝试在您的 VAR_WHITE_CHARLIST 变量参数中传递小写和大写字母:
查看此输入的结果:
a) 仅限小写:
参数:
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "abcdefghijklmnopqrstuvwxyz1234567890',.?;/ ");
结果:
05 atenienses nnito, hdeleto e laicao, os principais acusadores de gocrates, nao defendiam apenas que o filosofo corrompia a juventude; eles lutavam tama bern pelas virtudes da tradigao poetica vinculada a liornero. nristofanes, um dos responsaveis, segundo socrates, dos preconceitos contra o filosofo, era outro grande defensor dessa virtude.
socrates, de certa forma, estava em guerra com a tradieao poetica grega. 0 metodo de socrates era o oposto a narrativa epica de tlornero. sua dialetica nao tinha nada de semideuses corn superpoderes 6
b) 大小写字母:
参数:
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ1234567890',.?;/ ");
结果:
Os atenienses Anito, Meleto e Licao, os principais acusadores de Socrates, nao defendiam apenas que o filosofo corrompia a juventude; eles lutavam tama bern pelas virtudes da tradigao poetica vinculada a Homero. Aristofanes, um dos responsaveis, segundo socrates, dos preconceitos contra o filosofo, era outro grande defensor dessa virtude.
socrates, de certa forma, estava em guerra com a tradieao poetica grega. O metodo de socrates era o Oposto a narrativa epica de Homero. Sua dialetica nao tinha nada de semideuses corn superpoderes 6
PS:我 运行 这个例子使用的是葡萄牙语,检查一些需要不同字符的词,例如:'é ó ç' 它没有用,因为它不是'作为 char 传递到白名单中。
我也试过运行用你的图片,结果有改善(没那么多):
Font 20; Which polrlrcran has caplured Ihe curve, summed up a growing mood. In a Ierocrous speech? 'Your iron industry is dead. dead as munon. Your coal yum mono greatly on the iron Vbur Ilk Mary is and. o Your woolen induslry is Why. Your canon Mr Wilding induslry. blmailf
所以我检查了 tesseract 如何将图像二值化:
您的图片有太多噪点,然后 api 尝试对您的图片进行二值化处理,导致图片的很大一部分难以辨认。我建议你再试一次运行,但不要通过灰度,并尝试研究如何减少图像中的噪点。
为了帮助您完成调试任务,您可以保存阈值图像:
WriteFile.writeBitmap(baseApi.getThresholdedImage())
希望对你有用!感谢您分享您的问题!
Abraços!
在这一行 options.inSampleSize = 4; 将数字从 4 更改为 1,然后再次尝试进行 ocr