Tesseract OCR 中的 Blob 是什么

What is Blob in Tesseract OCR

我正在学习 Tesseract OCR 并正在阅读这篇文章 article that is based on this article。来自第一篇文章:

First step is Adaptive Thresholding, which converts the image into binary images. Next step is connected component analysis which is used to extract character outlines. This method is very useful because it does the OCR of image with white text and black background. Tesseract was probably first to provide this kind of processing. Then after, the outlines are converted into Blobs. Blobs are organized into text lines, and the lines and regions are analyzed for some fixed area or equivalent text size.

谁能解释一下什么是 Blob?

来自 https://tesseract-ocr.repairfaq.org/tess_glossary.html :

Blob

Isolated, small region of the scanned image. It's delineated by the outline. Tesseract 'juggles' the blobs to see if they can be split further into something that improved the confidence of recognition. Sometimes, blobs are 'combined' if that gives a better result. See pithsync.cpp, for example.

通常,blob(也称为连通分量)是二值图像中的连通部分(即未断开的部分)。换句话说,它是二值图像中的实体元素。 Blob 查找器是任何旨在 extracting/measuring 数字图像数据的系统中的关键步骤。