删除水平下划线

Question

我正试图从包含死刑记录信息的数百个 JPG 文件中提取文本； JPG 由德克萨斯刑事司法部 (TDCJ) 托管。以下是删除了个人身份信息的示例片段。

我已经确定下划线是正确 OCR 的障碍--如果我进去，截图一个子片段并手动涂白线条，得到的 OCR通过pytesseract很好。但是如果有下划线，那就太差了。

我怎样才能最好地去除这些水平线？我尝试过的：

从 OpenCV 文档的演练开始：Extract horizontal and vertical lines by using morphological operations。很快就卡住了，因为我对 C++ 的了解为零。
跟随 - 以难以辨认的字符串结尾。
跟随 Removing long horizontal/vertical lines from edge image using OpenCV - 无法在此处获得调整零数组大小的直觉。

将此问题标记为 c++ in the hope that someone could help to translate Step 5 of the docs walkthrough 至 Python。我已经尝试了一批转换，例如 Hugh Line Transform，但我在图书馆和我之前没有经验的区域的黑暗中感觉周围。

import cv2

# Inverted grayscale
img = cv2.imread('rsnippet.jpg', cv2.IMREAD_GRAYSCALE)
img = cv2.bitwise_not(img)

# Transform inverted grayscale to binary
th = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY, 15, -2)

# An alternative; Not sure if `th` or `th2` is optimal here
th2 = cv2.threshold(img, 170, 255, cv2.THRESH_BINARY)[1]

# Create corresponding structure element for horizontal lines.
# Start by cloning th/th2.
horiz = th.copy()
r, c = horiz.shape

# Lost after here - not understanding intuition behind sizing/partitioning

Answer 1

几点建议：

鉴于您是从 JPEG 开始的，请不要加重损失。将中间文件另存为 PNG。 Tesseract 处理得很好。
将图像缩放 2 倍（使用 cv2.resize）交给 Tesseract。
尝试检测并删除黑色下划线。（This question 可能有帮助）。在保留后代的同时这样做可能很棘手。
探索 Tesseract 命令行选项，其中有很多选项（而且它们的文档非常糟糕，有些需要深入研究 C++ 源代码才能理解它们）。看起来连字引起了一些悲伤。 IIRC（已经有一段时间了），有一两个设置可能会有所帮助。

Answer 2

大家可以试试这个。

img = cv2.imread('img_provided_by_op.jpg', 0)
img = cv2.bitwise_not(img)  

# (1) clean up noises
kernel_clean = np.ones((2,2),np.uint8)
cleaned = cv2.erode(img, kernel_clean, iterations=1)

# (2) Extract lines
kernel_line = np.ones((1, 5), np.uint8)  
clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)
clean_lines = cv2.dilate(clean_lines, kernel_line, iterations=6)

# (3) Subtract lines
cleaned_img_without_lines = cleaned - clean_lines
cleaned_img_without_lines = cv2.bitwise_not(cleaned_img_without_lines)

plt.imshow(cleaned_img_without_lines)
plt.show()
cv2.imwrite('img_wanted.jpg', cleaned_img_without_lines)

演示

该方法基于 Zaw Lin 的 answer。 He/she 识别出图像中的线条，只是做了减法以去除它们。但是，我们不能只减去这里的行，因为我们有字母e，t，E, T, - 也包含行！如果我们只是从图像中减去水平线，e 将几乎与 c 相同。 -会没了...

问：我们如何找到线？

要查找行，我们可以使用erode 函数。要使用 erode，我们需要定义一个内核。（您可以将内核视为函数运行的 window/shape。）

The kernel slides through the image (as in 2D convolution). A pixel in the original image (either 1 or 0) will be considered 1 only if all the pixels under the kernel is 1, otherwise it is eroded (made to zero). -- (Source).

为了提取行，我们定义了一个内核，kernel_line 为 np.ones((1, 5))，[1, 1, 1, 1, 1]。该内核将滑过图像并侵蚀内核下具有 0 的像素。

更具体地说，当内核应用于一个像素时，它将捕获其左侧的两个像素和右侧的两个像素。

 [X X Y X X]
      ^
      |
Applied to Y, `kernel_line` captures Y's neighbors. If any of them is not
0, Y will be set to 0.

水平线将保留在此内核下，而没有水平邻居的像素将消失。这就是我们使用以下行捕获行的方式。

clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)

问：我们如何避免在 e、E、t、T 和 - 中提取行？

我们将erosion and dilation与迭代参数组合。

clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)

您可能已经注意到 iterations=6 部分。该参数的作用会使e,E,t,T,-中的平坦部分消失。这是因为当我们多次应用相同的操作时，这些线的边界部分会缩小。（应用相同的内核，只有边界部分会遇到0，结果变成0。）我们使用这个技巧让这些字符中的线条消失。

然而，这会带来一个副作用，即我们想要去除的长下划线部分也会缩小。我们可以用 dilate!

来发展它

clean_lines = cv2.dilate(clean_lines, kernel_line, iterations=6)

与缩小图像的侵蚀相反，膨胀使图像变大。虽然我们仍然有相同的内核，kernel_line，但如果内核下的任何部分为 1，则目标像素将为 1。应用此，边界将重新生长。（e, E, t, T, -中的部分如果我们仔细选择参数，使其在侵蚀部分消失，就不会再长回来。）

通过这个额外的技巧，我们可以在不伤害 e、E、t、T 和 - 的情况下成功地摆脱线条。

Answer 3

到目前为止所有答案似乎都在使用形态学操作。这里有些不同。如果线条是水平.

，这应该会产生相当好的结果

为此，我使用了下面显示的示例图片的一部分。

加载图像，将其转换为灰度并反转。

import cv2
import numpy as np
import matplotlib.pyplot as plt

im = cv2.imread('sample.jpg')
gray = 255 - cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)

反转灰度图像：

如果您扫描此倒置图像中的一行，您会看到它的轮廓看起来有所不同，具体取决于行的存在与否。

plt.figure(1)
plt.plot(gray[18, :] > 16, 'g-')
plt.axis([0, gray.shape[1], 0, 1.1])
plt.figure(2)
plt.plot(gray[36, :] > 16, 'r-')
plt.axis([0, gray.shape[1], 0, 1.1])

绿色的配置文件是没有下划线的行，红色的是有下划线的行。如果您对每个个人资料取平均值，您会发现红色的平均水平更高。

因此，使用这种方法您可以检测下划线并将其删除。

for row in range(gray.shape[0]):
    avg = np.average(gray[row, :] > 16)
    if avg > 0.9:
        cv2.line(im, (0, row), (gray.shape[1]-1, row), (0, 0, 255))
        cv2.line(gray, (0, row), (gray.shape[1]-1, row), (0, 0, 0), 1)

cv2.imshow("gray", 255 - gray)
cv2.imshow("im", im)

这是检测到的红色下划线和清理后的图像。

清理图像的tesseract输出：

Convthed as th(
shot once in the
she stepped fr<
brother-in-lawii
collect on life in
applied for man
to the scheme i|

现在使用部分图片的原因应该很清楚了。由于原始图像中的个人身份信息已被删除，因此阈值将不起作用。但是当你应用它进行处理时，这应该不是问题。有时您可能需要调整阈值 (16, 0.9)。

结果看起来不太好，部分字母被移除，一些模糊的线条仍然存在。如果我能再改进一点就会更新。

更新：

Dis 一些改进；清理和 link 字母的缺失部分。我已经评论了代码，所以我相信这个过程很清楚。您还可以检查生成的中间图像以了解其工作原理。结果好一点。

清理图像的tesseract输出：

Convicted as th(
shot once in the
she stepped fr<
brother-in-law. ‘
collect on life ix
applied for man
to the scheme i|

清理图像的tesseract输出：

)r-hire of 29-year-old .
revolver in the garage ‘
red that the victim‘s h
{2000 to kill her. mum
250.000. Before the kil
If$| 50.000 each on bin
to police.

python代码：

import cv2
import numpy as np
import matplotlib.pyplot as plt

im = cv2.imread('sample2.jpg')
gray = 255 - cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
# prepare a mask using Otsu threshold, then copy from original. this removes some noise
__, bw = cv2.threshold(cv2.dilate(gray, None), 128, 255, cv2.THRESH_BINARY or cv2.THRESH_OTSU)
gray = cv2.bitwise_and(gray, bw)
# make copy of the low-noise underlined image
grayu = gray.copy()
imcpy = im.copy()
# scan each row and remove lines
for row in range(gray.shape[0]):
    avg = np.average(gray[row, :] > 16)
    if avg > 0.9:
        cv2.line(im, (0, row), (gray.shape[1]-1, row), (0, 0, 255))
        cv2.line(gray, (0, row), (gray.shape[1]-1, row), (0, 0, 0), 1)

cont = gray.copy()
graycpy = gray.copy()
# after contour processing, the residual will contain small contours
residual = gray.copy()
# find contours
contours, hierarchy = cv2.findContours(cont, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
for i in range(len(contours)):
    # find the boundingbox of the contour
    x, y, w, h = cv2.boundingRect(contours[i])
    if 10 < h:
        cv2.drawContours(im, contours, i, (0, 255, 0), -1)
        # if boundingbox height is higher than threshold, remove the contour from residual image
        cv2.drawContours(residual, contours, i, (0, 0, 0), -1)
    else:
        cv2.drawContours(im, contours, i, (255, 0, 0), -1)
        # if boundingbox height is less than or equal to threshold, remove the contour gray image
        cv2.drawContours(gray, contours, i, (0, 0, 0), -1)

# now the residual only contains small contours. open it to remove thin lines
st = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
residual = cv2.morphologyEx(residual, cv2.MORPH_OPEN, st, iterations=1)
# prepare a mask for residual components
__, residual = cv2.threshold(residual, 0, 255, cv2.THRESH_BINARY)

cv2.imshow("gray", gray)
cv2.imshow("residual", residual)   

# combine the residuals. we still need to link the residuals
combined = cv2.bitwise_or(cv2.bitwise_and(graycpy, residual), gray)
# link the residuals
st = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (1, 7))
linked = cv2.morphologyEx(combined, cv2.MORPH_CLOSE, st, iterations=1)
cv2.imshow("linked", linked)
# prepare a msak from linked image
__, mask = cv2.threshold(linked, 0, 255, cv2.THRESH_BINARY)
# copy region from low-noise underlined image
clean = 255 - cv2.bitwise_and(grayu, mask)
cv2.imshow("clean", clean)
cv2.imshow("im", im)

Answer 4

因为你源码中大部分检测到的线都是横长线，和我另一个答案类似，就是

这是源图片：

以下是我去除长横线的两个主要步骤：

Do morph-close with long line kernel on the gray image

kernel = np.ones((1,40), np.uint8)
morphed = cv2.morphologyEx(gray, cv2.MORPH_CLOSE, kernel)

then, get the morphed image contains the long lines:

Invert the morphed image, and add to the source image:

dst = cv2.add(gray, (255-morphed))

然后获取去除长线的图像：

很简单吧？还有small line segments，我觉得对OCR影响不大。请注意，几乎所有字符都保持原始状态，除了 g、j、p、q、y、Q，可能有点不同。但是现代 OCR 工具，如 Tesseract（具有 LSTM 技术）有能力处理这种简单的混淆。

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

将删除的图像保存为 line_removed.png:

的总代码

#!/usr/bin/python3
# 2018.01.21 16:33:42 CST

import cv2
import numpy as np

## Read
img = cv2.imread("img04.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

## (1) Create long line kernel, and do morph-close-op
kernel = np.ones((1,40), np.uint8)
morphed = cv2.morphologyEx(gray, cv2.MORPH_CLOSE, kernel)
cv2.imwrite("line_detected.png", morphed)


## (2) Invert the morphed image, and add to the source image:
dst = cv2.add(gray, (255-morphed))
cv2.imwrite("line_removed.png", dst)

更新@ 2018.01.23 13:15:15 CST:

Tesseract 是一款强大的 OCR 工具。今天我安装了 tesseract-4.0 和 pytesseract。然后我在结果 line_removed.png.

上使用 pytesseract 进行 ocr

import cv2       
import pytesseract
img = cv2.imread("line_removed.png")
print(pytesseract.image_to_string(img, lang="eng"))

这就是结果，对我来说很好。

Convicted as the triggerman in the murder—for—hire of 29—year—old .

shot once in the head with a 357 Magnum revolver in the garage of her home at ..
she stepped from her car. Police discovered that the victim‘s husband,
brother—in—law, _ ______ paid _ ,000 to kill her, apparently so .. _
collect on life insurance policies totaling 0,000. Before the killing, .

applied for additional life insurance policies of 0,000 each on himself and his wife
to the scheme in three different statements to police.

was

and
could
had also

. confessed

删除水平下划线

Removing horizontal underlines

c++

python

opencv

tesseract

cv2

演示

问：我们如何找到线？

问：我们如何避免在 e、E、t、T 和 - 中提取行？

更新@ 2018.01.23 13:15:15 CST: