如何提取特定颜色的像素点进行OCR？

Question

我想运行一些小的 images/sprites 通过 OCR（可能是 Tesseract）并从中提取数字或单词，我知道这些 number/words 将是特定颜色（比如 noisy/colored 背景上的白色）。

在阅读有关 OCR 预处理图像的内容时，我认为从图像中删除非白色的所有内容会非常有益。

我同时使用 imagemagick 和 vips，但我不知道从哪里开始、使用什么操作以及如何搜索它。

Answer 1

我不是这方面的专家，但也许尝试将所有 RGB 值低于特定阈值的像素更改为黑色，或删除它们？正如我之前提到的，我对这些都不是很了解，但我不明白为什么这行不通。

Answer 2

如果图像是合成的且未压缩的，您可以测试 RGB 值是否严格相等。否则，对 RGB 三元组（例如欧几里得或曼哈顿）之间的距离使用阈值。

如果要允许亮度变化但不允许颜色变化，可以转换为 HLS 并比较 HS。

Answer 3

如果我们制作这样的示例图像：

magick -size 300x100 xc: +noise random -gravity center -fill white -pointsize 48 -annotate 0 "Hello" captcha.png

然后你可以用黑色填充任何不是白色的东西：

magick captcha.png -fill black +opaque white result.png

如果你想接受接近白色的颜色为白色，你可以添加一些"fuzz":

magick captcha.png -fuzz 10% -fill black +opaque white result.png

Answer 4

几个月前在 libvips 跟踪器上讨论了背景去除技术：

https://github.com/libvips/libvips/issues/1567

这是过滤器：

#!/usr/bin/python3

import sys 
import pyvips

image = pyvips.Image.new_from_file(sys.argv[1], access="sequential")

# aim for 250 for paper with low freq. removal
# ink seems to be slightly blueish
paper = 250
ink = [150, 160, 170]

# remove low frequencies .. don't need huge accuracy
low_freq = image.gaussblur(20, precision="integer")
image = image - low_freq + paper

# pull the ink down
ink_target = 30
scale = [(paper - ink_target) / (paper - i) for i in ink]
offset = [ink_target - i * s for i, s in zip(ink, scale)]
image = image * scale + offset

# find distance to white of each pixel ... small distances go to white
white = [100, 0, 0]
image = image.colourspace("lab")
d = image.dE76(white)
image = (d < 12).ifthenelse(white, image)

# boost saturation (scale ab)
image = image * [1, 2, 2]

image.write_to_file(sys.argv[2])

它去除低频（即纸张折叠等），拉伸对比度范围，在 CIELAB 中找到接近白色的像素并将其移动到白色，并提高饱和度。

您可能需要为您的 use-case 稍微调整一下。 Post 如果您需要更多建议，请提供一些示例图片。

如何提取特定颜色的像素点进行OCR？

How to extract the pixels of a specific color for OCR?

ocr

tesseract

imagemagick

image-processing

vips