Tesseract OCR 德语特殊字符
Tesseract OCR German Special Characters
我在 C++ 中使用 tesseract ocr 读取德国 png 图像,我遇到了一些特殊字符的问题,比如
ß ä ö ü 等等。
我是否需要训练 tesseract 才能正确阅读或需要做什么?
This is the part of the original image read by tesseract
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
更新
SetConsoleOutputCP(1252);//changed to german.
SetConsoleCP(1252);//changed to german
wcout << "ÄÖÜ?ß" << endl;
// Open input image with leptonica library
Pix *image = pixRead("D:\Images\Document.png");
api->Init("D:\TesseractBeispiele\Tessaractbeispiel\Tessaractbeispiel\tessdata", "deu");
api->SetImage(image);
api->SetVariable("save_blob_choices", "T");
api->SetRectangle(1000, 3000, 9000, 9000);
api->Recognize(NULL);
// Get OCR result
wcout << api->GetUTF8Text());
After changing the Code below the Update
硬编码变音符号将正确显示,但图像中的文本不正确,我需要更改什么?
tesseract 版本为 3.0.2
leptonica 版本是 1.68
Tesseract 可以识别 Unicode 字符。您的控制台可能未配置为显示它们。
What encoding/code page is cmd.exe using?
Unicode characters in Windows command line - how?
i don't how to detect German the word from the image in windows environment. but i know how to detect German word to Linux environment. following code may get you some idea.
/*
* word_OCR.cpp
*
* Created on: Jun 23, 2016
* Author: root
*/
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <iostream>
using namespace std;
int main(int argc ,char **argv)
{
Pix *image = pixRead(argv[1]);
if (image == 0) {
cout << "Cannot load input file!\n";
}
tesseract::TessBaseAPI tess;
// insted of the passing "eng" pass "deu".
if (tess.Init("/usr/share/tesseract/tessdata", "deu")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
tess.SetImage(image);
tess.Recognize(0);
tesseract::ResultIterator *ri = tess.GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
if(ri!=0)
{
do {
const char *word = ri->GetUTF8Text(level);
cout << word << endl;
delete []word;
} while (ri->Next(level));
delete []ri;
}
}
one thing you have to take care that pass good resolution image then and then it works fine.
我在 C++ 中使用 tesseract ocr 读取德国 png 图像,我遇到了一些特殊字符的问题,比如
ß ä ö ü 等等。
我是否需要训练 tesseract 才能正确阅读或需要做什么?
This is the part of the original image read by tesseract
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
更新
SetConsoleOutputCP(1252);//changed to german.
SetConsoleCP(1252);//changed to german
wcout << "ÄÖÜ?ß" << endl;
// Open input image with leptonica library
Pix *image = pixRead("D:\Images\Document.png");
api->Init("D:\TesseractBeispiele\Tessaractbeispiel\Tessaractbeispiel\tessdata", "deu");
api->SetImage(image);
api->SetVariable("save_blob_choices", "T");
api->SetRectangle(1000, 3000, 9000, 9000);
api->Recognize(NULL);
// Get OCR result
wcout << api->GetUTF8Text());
After changing the Code below the Update 硬编码变音符号将正确显示,但图像中的文本不正确,我需要更改什么?
tesseract 版本为 3.0.2 leptonica 版本是 1.68
Tesseract 可以识别 Unicode 字符。您的控制台可能未配置为显示它们。
What encoding/code page is cmd.exe using?
Unicode characters in Windows command line - how?
i don't how to detect German the word from the image in windows environment. but i know how to detect German word to Linux environment. following code may get you some idea.
/*
* word_OCR.cpp
*
* Created on: Jun 23, 2016
* Author: root
*/
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <iostream>
using namespace std;
int main(int argc ,char **argv)
{
Pix *image = pixRead(argv[1]);
if (image == 0) {
cout << "Cannot load input file!\n";
}
tesseract::TessBaseAPI tess;
// insted of the passing "eng" pass "deu".
if (tess.Init("/usr/share/tesseract/tessdata", "deu")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
tess.SetImage(image);
tess.Recognize(0);
tesseract::ResultIterator *ri = tess.GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
if(ri!=0)
{
do {
const char *word = ri->GetUTF8Text(level);
cout << word << endl;
delete []word;
} while (ri->Next(level));
delete []ri;
}
}
one thing you have to take care that pass good resolution image then and then it works fine.