去除图像中的背景噪音，使 OCR 文本更清晰

Question

我编写了一个应用程序，可以根据其中的文本区域对图像进行分割，并提取我认为合适的那些区域。我试图做的是清理图像，以便 OCR (Tesseract) 给出准确的结果。我以下图为例：

运行这通过 tesseract 给出了一个广泛不准确的结果。然而清理图像（使用photoshop）得到图像如下：

给出了我期望的结果。第一张图片已经被运行通过以下方法清理到那个程度：

 public Mat cleanImage (Mat srcImage) {
    Core.normalize(srcImage, srcImage, 0, 255, Core.NORM_MINMAX);
    Imgproc.threshold(srcImage, srcImage, 0, 255, Imgproc.THRESH_OTSU);
    Imgproc.erode(srcImage, srcImage, new Mat());
    Imgproc.dilate(srcImage, srcImage, new Mat(), new Point(0, 0), 9);
    return srcImage;
}

我还能做些什么来清理第一张图片，使其与第二张图片相似？

编辑：这是通过cleanImage函数运行之前的原始图像。

Answer 1

这张图片对你有帮助吗？

生成该图像的算法很容易实现。我敢肯定，如果你调整它的一些参数，你可以获得非常好的图像效果。

我用 tesseract 测试了所有图像：

原始图像：未检测到任何内容
已处理图像 #1：未检测到任何内容
处理后的图像 #2：12-14（完全匹配）
我处理过的图片：y'1'2-14/j

Answer 2

我的回答基于以下假设。有可能 none 个适用于您的情况。

您可以为分割区域中的边界框高度设置一个阈值。然后你应该能够过滤掉其他组件。
您知道数字的平均笔画宽度。使用此信息可以最大限度地减少数字连接到其他区域的可能性。您可以为此使用距离变换和形态学操作。

这是我提取数字的程序：

对图像应用 Otsu 阈值
进行距离变换
使用 stroke-width ( = 8) 约束对距离变换图像进行阈值处理
应用形态学操作断开连接
过滤边界框高度并猜测数字在哪里

笔划宽度 = 8 笔划宽度 = 10

编辑

使用找到的数字轮廓的凸包准备掩码
使用掩码将数字区域复制到干净的图像

笔划宽度 = 8

笔画宽度 = 10

我的Tesseract知识有点生疏。我记得你可以获得角色的置信度。如果您仍然碰巧将噪声区域检测为字符边界框，则您可以使用此信息过滤掉噪声。

C++代码

Mat im = imread("aRh8C.png", 0);
// apply Otsu threshold
Mat bw;
threshold(im, bw, 0, 255, CV_THRESH_BINARY_INV | CV_THRESH_OTSU);
// take the distance transform
Mat dist;
distanceTransform(bw, dist, CV_DIST_L2, CV_DIST_MASK_PRECISE);
Mat dibw;
// threshold the distance transformed image
double SWTHRESH = 8;    // stroke width threshold
threshold(dist, dibw, SWTHRESH/2, 255, CV_THRESH_BINARY);
Mat kernel = getStructuringElement(MORPH_RECT, Size(3, 3));
// perform opening, in case digits are still connected
Mat morph;
morphologyEx(dibw, morph, CV_MOP_OPEN, kernel);
dibw.convertTo(dibw, CV_8U);
// find contours and filter
Mat cont;
morph.convertTo(cont, CV_8U);

Mat binary;
cvtColor(dibw, binary, CV_GRAY2BGR);

const double HTHRESH = im.rows * .5;    // height threshold
vector<vector<Point>> contours;
vector<Vec4i> hierarchy;
vector<Point> digits; // points corresponding to digit contours

findContours(cont, contours, hierarchy, CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE, Point(0, 0));
for(int idx = 0; idx >= 0; idx = hierarchy[idx][0])
{
    Rect rect = boundingRect(contours[idx]);
    if (rect.height > HTHRESH)
    {
        // append the points of this contour to digit points
        digits.insert(digits.end(), contours[idx].begin(), contours[idx].end());

        rectangle(binary, 
            Point(rect.x, rect.y), Point(rect.x + rect.width - 1, rect.y + rect.height - 1),
            Scalar(0, 0, 255), 1);
    }
}

// take the convexhull of the digit contours
vector<Point> digitsHull;
convexHull(digits, digitsHull);
// prepare a mask
vector<vector<Point>> digitsRegion;
digitsRegion.push_back(digitsHull);
Mat digitsMask = Mat::zeros(im.rows, im.cols, CV_8U);
drawContours(digitsMask, digitsRegion, 0, Scalar(255, 255, 255), -1);
// expand the mask to include any information we lost in earlier morphological opening
morphologyEx(digitsMask, digitsMask, CV_MOP_DILATE, kernel);
// copy the region to get a cleaned image
Mat cleaned = Mat::zeros(im.rows, im.cols, CV_8U);
dibw.copyTo(cleaned, digitsMask);

编辑

Java代码

Mat im = Highgui.imread("aRh8C.png", 0);
// apply Otsu threshold
Mat bw = new Mat(im.size(), CvType.CV_8U);
Imgproc.threshold(im, bw, 0, 255, Imgproc.THRESH_BINARY_INV | Imgproc.THRESH_OTSU);
// take the distance transform
Mat dist = new Mat(im.size(), CvType.CV_32F);
Imgproc.distanceTransform(bw, dist, Imgproc.CV_DIST_L2, Imgproc.CV_DIST_MASK_PRECISE);
// threshold the distance transform
Mat dibw32f = new Mat(im.size(), CvType.CV_32F);
final double SWTHRESH = 8.0;    // stroke width threshold
Imgproc.threshold(dist, dibw32f, SWTHRESH/2.0, 255, Imgproc.THRESH_BINARY);
Mat dibw8u = new Mat(im.size(), CvType.CV_8U);
dibw32f.convertTo(dibw8u, CvType.CV_8U);

Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3));
// open to remove connections to stray elements
Mat cont = new Mat(im.size(), CvType.CV_8U);
Imgproc.morphologyEx(dibw8u, cont, Imgproc.MORPH_OPEN, kernel);
// find contours and filter based on bounding-box height
final double HTHRESH = im.rows() * 0.5; // bounding-box height threshold
List<MatOfPoint> contours = new ArrayList<MatOfPoint>();
List<Point> digits = new ArrayList<Point>();    // contours of the possible digits
Imgproc.findContours(cont, contours, new Mat(), Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_SIMPLE);
for (int i = 0; i < contours.size(); i++)
{
    if (Imgproc.boundingRect(contours.get(i)).height > HTHRESH)
    {
        // this contour passed the bounding-box height threshold. add it to digits
        digits.addAll(contours.get(i).toList());
    }   
}
// find the convexhull of the digit contours
MatOfInt digitsHullIdx = new MatOfInt();
MatOfPoint hullPoints = new MatOfPoint();
hullPoints.fromList(digits);
Imgproc.convexHull(hullPoints, digitsHullIdx);
// convert hull index to hull points
List<Point> digitsHullPointsList = new ArrayList<Point>();
List<Point> points = hullPoints.toList();
for (Integer i: digitsHullIdx.toList())
{
    digitsHullPointsList.add(points.get(i));
}
MatOfPoint digitsHullPoints = new MatOfPoint();
digitsHullPoints.fromList(digitsHullPointsList);
// create the mask for digits
List<MatOfPoint> digitRegions = new ArrayList<MatOfPoint>();
digitRegions.add(digitsHullPoints);
Mat digitsMask = Mat.zeros(im.size(), CvType.CV_8U);
Imgproc.drawContours(digitsMask, digitRegions, 0, new Scalar(255, 255, 255), -1);
// dilate the mask to capture any info we lost in earlier opening
Imgproc.morphologyEx(digitsMask, digitsMask, Imgproc.MORPH_DILATE, kernel);
// cleaned image ready for OCR
Mat cleaned = Mat.zeros(im.size(), CvType.CV_8U);
dibw8u.copyTo(cleaned, digitsMask);
// feed cleaned to Tesseract

Answer 3

只是一点点开箱即用的想法：

我从你的原始图像中可以看出它是一个相当严格的预格式化文件，看起来像路税徽章或类似的东西，对吧？

如果上面的假设是正确的，那么您可以实施一个不太通用的解决方案：您试图消除的噪音是由于特定文档模板的特征，它出现在图像的特定和已知区域.事实上，文字也是如此。

在那种情况下，一种解决方法是定义您知道存在 "noise" 的区域的边界，然后将它们抹掉。

然后，按照您已经执行的其余步骤进行操作：进行降噪以去除最细微的细节（即看起来像徽章中的安全水印或全息图的背景图案）。结果应该足够清晰，Tesseract 可以毫不费力地进行处理。

无论如何只是一个想法。这不是一个通用的解决方案，我承认，所以这取决于你的实际需求。

Answer 4

我认为您需要在预处理部分做更多的工作，以便在调用 tesseract 之前让图像尽可能清晰。

我的想法是：

1-从图像中提取轮廓并在图像中找到轮廓（检查this) and this

2-每个轮廓都有宽度、高度和面积，所以你可以根据宽度、高度和面积过滤轮廓（检查this and this），另外你可以使用轮廓分析的某些部分在此处编写代码以过滤轮廓等您可以使用模板轮廓匹配删除与“字母或数字”轮廓不相似的轮廓。

3- 过滤轮廓后，您可能会检查此图像中的字母和数字在哪里，因此您可能需要使用一些文本检测方法，如 here

4- 如果要删除非文本区域和图像中不好的轮廓，您现在需要的一切

5- 现在您可以创建二值化方法，或者您可以使用 tesseract 对图像进行二值化，然后在图像上调用 OCR。

当然，这些是执行此操作的最佳步骤，您可以使用其中的一些，这对您来说可能就足够了。

其他想法：

您可以使用不同的方法来做到这一点，最好的办法是找到一种方法来检测数字和字符的位置，使用不同的方法（例如模板匹配）或基于特征的方法（例如 HOG）。
你可能先对你的图像进行二值化得到二值图像，然后你需要在水平和垂直方向应用带线结构的开口，这将有助于你之后检测边缘然后对图像进行分割，然后进行 OCR。
检测完图像中的所有轮廓后，您还可以使用Hough transformation来检测任何一种直线和定义的曲线，例如one，这样您可以检测到有线的字符，所以你可以分割图像并在之后进行 OCR。

更简单的方法：

1- 进行二值化

2-一些分离轮廓的形态学操作：

3- 反转图像中的颜色（这可能在第 2 步之前）

4-找到图像中的所有轮廓

5-删除所有宽度大于高度的轮廓，删除非常小的轮廓，非常大的轮廓，非矩形轮廓

注意：您可以使用文本检测方法（或使用 HOG 或边缘检测）代替步骤 4 和 5

6- 找到包含图像中所有剩余轮廓的大矩形

7- 你可以做一些额外的预处理来增强 tesseract 的输入然后你现在可以调用 OCR。（我建议您裁剪图像并将其作为 OCR 的输入 [我的意思是裁剪黄色矩形，不要将整个图像作为黄色矩形的输入，这也会增强结果]）

Answer 5

字号不能太大也不能太小，大约在10-12pt范围内（即字高大约在20以上80以下）。您可以对图像进行采样并尝试使用 tesseract。并且很少有字体没有经过 tesseract 训练，如果不是经过训练的字体，则可能会出现问题。

去除图像中的背景噪音，使 OCR 文本更清晰

Remove background noise from image to make text more clear for OCR

c++

java

ocr

opencv