为什么 tesseract::ResultIterator 将中文单词分成单独的单词?

Why tesseract::ResultIterator breaks Chinese word into separate words?

我有这样一张图: Chinese characters

我想找到 "简体中文" 的位置,但由于某些原因 ResultIteratorLevel::RIL_WORDResultIterator 会像这样中断它:

word: "简体"
word: "中"
word: "文"

我不明白为什么会这样。我尝试了很多选项,不同的页面分割模式,但没有运气。但是,当我使用具有指定坐标的 getUTF8Text() 时,它会 returns 正确的 "简体中文" 中文文本。 我如何使用 ResultIterator?

获得正确的结果

版本:

tesseract 5.0.0
 leptonica-1.78.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511

完整代码:

#include <iostream>

#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>

int main() {
  const char *pattern = "简体中文";
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

  Pix *image = pixRead("chinese_characters.png");
  if (api->Init("/usr/local/share/tessdata/", "chi_sim")) {
    fprintf(stderr, "Could not initialize tesseract.\n");
    exit(1);
  }
  api->SetImage(image);
  api->Recognize(0);
  tesseract::ResultIterator *ri = api->GetIterator();
  tesseract::PageIteratorLevel level = tesseract::RIL_WORD;

  if (ri != 0) {
    do {
      const char *word = ri->GetUTF8Text(level);
      float conf = ri->Confidence(level);
      int x1, y1, x2, y2;
      ri->BoundingBox(level, &x1, &y1, &x2, &y2);
      printf("word: '%s';  conf: %.2f; BoundingBox: %d,%d,%d,%d;\n", word, conf,
             x1, y1, x2, y2);
    } while (ri->Next(level));
  }
  // Destroy used object and release memory
  api->End();
  delete api;
  pixDestroy(&image);

  return 0;
}

完整输出:

word: '单词';  conf: 94.69; BoundingBox: 170,226,270,275;
word: '“单词';  conf: 55.34; BoundingBox: 390,226,490,275;
word: '单词';  conf: 88.91; BoundingBox: 610,226,710,275;
word: '单词';  conf: 92.26; BoundingBox: 830,226,930,275;
word: '简体';  conf: 96.09; BoundingBox: 95,372,199,421;
word: '中';  conf: 93.13; BoundingBox: 228,372,291,421;
word: '文';  conf: 48.71; BoundingBox: 290,368,348,444;
word: '”单词';  conf: 48.71; BoundingBox: 393,375,493,424;
word: '单词';  conf: 91.40; BoundingBox: 613,375,713,424;
word: '单词';  conf: 86.79; BoundingBox: 833,375,933,424;
word: '单词';  conf: 57.25; BoundingBox: 1053,375,1153,424;
word: '单词';  conf: 94.69; BoundingBox: 174,520,274,569;
word: '“单词';  conf: 55.34; BoundingBox: 394,520,494,569;
word: '单词';  conf: 88.91; BoundingBox: 614,520,714,569;
word: '单词';  conf: 92.26; BoundingBox: 834,520,934,569;

实际上这是一个正确的行为,因为在中文中一些特定的符号可能是单独的单词。如果你想在没有空格的情况下识别这些符号,那么只需使用 tesseract::RIL_SYMBOL 而不是 tesseract::RIL_WORD。因此,您可以一个一个地遍历每个符号。