为什么 tesseract::ResultIterator 将中文单词分成单独的单词?
Why tesseract::ResultIterator breaks Chinese word into separate words?
我有这样一张图:
Chinese characters
我想找到 "简体中文"
的位置,但由于某些原因 ResultIteratorLevel::RIL_WORD
,ResultIterator
会像这样中断它:
word: "简体"
word: "中"
word: "文"
我不明白为什么会这样。我尝试了很多选项,不同的页面分割模式,但没有运气。但是,当我使用具有指定坐标的 getUTF8Text()
时,它会 returns 正确的 "简体中文"
中文文本。
我如何使用 ResultIterator
?
获得正确的结果
版本:
tesseract 5.0.0
leptonica-1.78.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
完整代码:
#include <iostream>
#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
int main() {
const char *pattern = "简体中文";
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
Pix *image = pixRead("chinese_characters.png");
if (api->Init("/usr/local/share/tessdata/", "chi_sim")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
api->SetImage(image);
api->Recognize(0);
tesseract::ResultIterator *ri = api->GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
if (ri != 0) {
do {
const char *word = ri->GetUTF8Text(level);
float conf = ri->Confidence(level);
int x1, y1, x2, y2;
ri->BoundingBox(level, &x1, &y1, &x2, &y2);
printf("word: '%s'; conf: %.2f; BoundingBox: %d,%d,%d,%d;\n", word, conf,
x1, y1, x2, y2);
} while (ri->Next(level));
}
// Destroy used object and release memory
api->End();
delete api;
pixDestroy(&image);
return 0;
}
完整输出:
word: '单词'; conf: 94.69; BoundingBox: 170,226,270,275;
word: '“单词'; conf: 55.34; BoundingBox: 390,226,490,275;
word: '单词'; conf: 88.91; BoundingBox: 610,226,710,275;
word: '单词'; conf: 92.26; BoundingBox: 830,226,930,275;
word: '简体'; conf: 96.09; BoundingBox: 95,372,199,421;
word: '中'; conf: 93.13; BoundingBox: 228,372,291,421;
word: '文'; conf: 48.71; BoundingBox: 290,368,348,444;
word: '”单词'; conf: 48.71; BoundingBox: 393,375,493,424;
word: '单词'; conf: 91.40; BoundingBox: 613,375,713,424;
word: '单词'; conf: 86.79; BoundingBox: 833,375,933,424;
word: '单词'; conf: 57.25; BoundingBox: 1053,375,1153,424;
word: '单词'; conf: 94.69; BoundingBox: 174,520,274,569;
word: '“单词'; conf: 55.34; BoundingBox: 394,520,494,569;
word: '单词'; conf: 88.91; BoundingBox: 614,520,714,569;
word: '单词'; conf: 92.26; BoundingBox: 834,520,934,569;
实际上这是一个正确的行为,因为在中文中一些特定的符号可能是单独的单词。如果你想在没有空格的情况下识别这些符号,那么只需使用 tesseract::RIL_SYMBOL
而不是 tesseract::RIL_WORD
。因此,您可以一个一个地遍历每个符号。
我有这样一张图: Chinese characters
我想找到 "简体中文"
的位置,但由于某些原因 ResultIteratorLevel::RIL_WORD
,ResultIterator
会像这样中断它:
word: "简体"
word: "中"
word: "文"
我不明白为什么会这样。我尝试了很多选项,不同的页面分割模式,但没有运气。但是,当我使用具有指定坐标的 getUTF8Text()
时,它会 returns 正确的 "简体中文"
中文文本。
我如何使用 ResultIterator
?
版本:
tesseract 5.0.0
leptonica-1.78.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
完整代码:
#include <iostream>
#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
int main() {
const char *pattern = "简体中文";
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
Pix *image = pixRead("chinese_characters.png");
if (api->Init("/usr/local/share/tessdata/", "chi_sim")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
api->SetImage(image);
api->Recognize(0);
tesseract::ResultIterator *ri = api->GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
if (ri != 0) {
do {
const char *word = ri->GetUTF8Text(level);
float conf = ri->Confidence(level);
int x1, y1, x2, y2;
ri->BoundingBox(level, &x1, &y1, &x2, &y2);
printf("word: '%s'; conf: %.2f; BoundingBox: %d,%d,%d,%d;\n", word, conf,
x1, y1, x2, y2);
} while (ri->Next(level));
}
// Destroy used object and release memory
api->End();
delete api;
pixDestroy(&image);
return 0;
}
完整输出:
word: '单词'; conf: 94.69; BoundingBox: 170,226,270,275;
word: '“单词'; conf: 55.34; BoundingBox: 390,226,490,275;
word: '单词'; conf: 88.91; BoundingBox: 610,226,710,275;
word: '单词'; conf: 92.26; BoundingBox: 830,226,930,275;
word: '简体'; conf: 96.09; BoundingBox: 95,372,199,421;
word: '中'; conf: 93.13; BoundingBox: 228,372,291,421;
word: '文'; conf: 48.71; BoundingBox: 290,368,348,444;
word: '”单词'; conf: 48.71; BoundingBox: 393,375,493,424;
word: '单词'; conf: 91.40; BoundingBox: 613,375,713,424;
word: '单词'; conf: 86.79; BoundingBox: 833,375,933,424;
word: '单词'; conf: 57.25; BoundingBox: 1053,375,1153,424;
word: '单词'; conf: 94.69; BoundingBox: 174,520,274,569;
word: '“单词'; conf: 55.34; BoundingBox: 394,520,494,569;
word: '单词'; conf: 88.91; BoundingBox: 614,520,714,569;
word: '单词'; conf: 92.26; BoundingBox: 834,520,934,569;
实际上这是一个正确的行为,因为在中文中一些特定的符号可能是单独的单词。如果你想在没有空格的情况下识别这些符号,那么只需使用 tesseract::RIL_SYMBOL
而不是 tesseract::RIL_WORD
。因此,您可以一个一个地遍历每个符号。