使用 OpenCV 清理文本图像以进行 OCR 读取

Question

我收到了一些需要处理的图像，以便从中提取一些信息。以下是原文：

原1

原2

原3

原4

用这段代码处理后：

img = cv2.imread('original_1.jpg', 0) 
ret,thresh = cv2.threshold(img,55,255,cv2.THRESH_BINARY)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, cv2.getStructuringElement(cv2.MORPH_RECT,(2,2)))
cv2.imwrite('result_1.jpg', opening)

我得到这些结果：

结果 1

结果2

结果3

结果 4

如您所见，一些图像在 OCR 读取方面取得了不错的效果，而另一些图像的背景仍然存在一些噪点。

关于如何清理背景有什么建议吗？

Answer 1

一个小的中值滤波器得到了这个结果：

代码（Opencv C++）：

Mat im = imread("E:/4.jpg",0);
medianBlur(im, im, 3);
threshold(im, im, 70, 255, THRESH_BINARY_INV);
imshow("1", im);
waitKey(0);

Answer 2

MH304 的回答非常好，直截了当。如果您无法使用形态学或模糊来获得更清晰的图像，请考虑使用 "Area Filter"。也就是说，过滤每个不显示最小面积的斑点。

使用 opencv 的 connectedComponentsWithStats，这是一个非常基本的区域过滤器的 C++ 实现：

cv::Mat outputLabels, stats, img_color, centroids;

int numberofComponents = cv::connectedComponentsWithStats(bwImage, outputLabels, 
stats, centroids, connectivity);

std::vector<cv::Vec3b> colors(numberofComponents+1);
colors[i] = cv::Vec3b(rand()%256, rand()%256, rand()%256);

//do not count the original background-> label = 0:
colors[0] = cv::Vec3b(0,0,0);

//Area threshold:
int minArea = 10; //10 px

for( int i = 1; i <= numberofComponents; i++ ) {

    //get the area of the current blob:
    auto blobArea = stats.at<int>(i-1, cv::CC_STAT_AREA);

    //apply the area filter:
    if ( blobArea < minArea )
    {
        //filter blob below minimum area:
        //small regions are painted with (ridiculous) pink color
        colors[i-1] = cv::Vec3b(248,48,213);

    }

}

使用区域过滤器，我在你最嘈杂的图像上得到了这个结果：

**附加信息：

基本上，算法是这样的：

将二进制图像传递给 connectedComponentsWithStats。功能将计算连接组件的数量、标签矩阵和 statistics 的附加矩阵 – 包括 blob 区域。
准备一个大小为“numberOfcomponents”的颜色向量，这将有助于可视化我们实际过滤的斑点。颜色由 rand 函数随机生成。从 0 到 255 的范围内，每个像素有 3 个值：BGR。
考虑到背景是黑色的，所以忽略这个“连通分量”和它的颜色（黑色）。
设置区域阈值。该区域下方的所有斑点或像素都将涂上（荒谬的）粉红色。
遍历所有找到的连接组件（blob），通过统计矩阵检索当前 blob 的面积，并将其与面积阈值进行比较。
如果该区域低于阈值，则将斑点涂成粉红色（在这种情况下，但通常您需要黑色）。

Answer 3

这是一个完全编码的 Python 解决方案，基于 @eldesgraciado 提供的指导。

此代码假定您已经在使用正确二值化的黑底白字图像（例如，在灰度转换、黑帽变形和 Otsu 的阈值处理之后）- OpenCV 文档建议在以下情况下使用具有白色前景的二值化图像应用形态学操作和类似的东西。

num_comps, labeled_pixels, comp_stats, comp_centroids = \
    cv2.connectedComponentsWithStats(thresh_image, connectivity=4)
min_comp_area = 10 # pixels
# get the indices/labels of the remaining components based on the area stat
# (skip the background component at index 0)
remaining_comp_labels = [i for i in range(1, num_comps) if comp_stats[i][4] >= min_comp_area]
# filter the labeled pixels based on the remaining labels, 
# assign pixel intensity to 255 (uint8) for the remaining pixels
clean_img = np.where(np.isin(labeled_pixels,remaining_comp_labels)==True,255,0).astype('uint8')

此解决方案的优势在于，它允许您过滤掉噪音，而不会对可能已经受到威胁的字符产生负面影响。

我处理的脏扫描具有合并字符和字符侵蚀等不良影响，我发现天下没有免费的午餐——即使是 3x3 内核和一次迭代结果看似无害的打开操作也是如此在某些字符退化中（尽管对于消除字符周围的噪音非常有效）。

因此，如果字符质量允许，对整个图像进行直接清理操作（例如模糊、打开、关闭）是可以的，但如果不允许 - 这应该首先完成。

P.S。还有一件事 - 在处理文本图像时，您不应该使用像 JPEG 这样的有损格式，而应该使用像 PNG 这样的无损格式。

Answer 4

使用这个，它会消除噪音：

cv2.bilateralFilter(img,9,75,75)

使用 OpenCV 清理文本图像以进行 OCR 读取

Clean text images with OpenCV for OCR reading

python

ocr

opencv

tesseract