使用正则表达式更正 OCR 输出相同的字符（大写 I 与 1 等）

Question

我有一个训练有素的 OCR 模型，可以读取特定的字体。其中一些字体具有相同外观的字符，例如 1 和大写字母 i，因此偶尔当词表预测失败时，我会变得不恰当 I's where 1's where should be and 1's where I's should be.

就我而言，我知道永远不应该...

一个字符串中的1；例如，1NDEPENDENCE DAY
整数中的I；例如，45I OZ
I 位于某些特殊字符（如 %、+ 和 - ）旁边；例如，I% OFF TEMP: -I DEGREES
一个孤I--这些都是1；例如，TIME: I TO 5 PM
连续我；例如II A.M.

这是我第一次尝试解决其中的一些情况，但我确信有更有效的方法来做到这一点。也许用 re.sub)?

循环正则表达式列表

import re

ocr_output = "TIME: I TO 5 PM, I% OFF, TEMP: -I DEGREES, II07 OZ, 1NDEPENDENCE DAY, II A.M."

while True:
    x = re.search("[\d+-]I", ocr_output)
    if x:
        ocr_output = ocr_output[:x.start()+1] + '1' + ocr_output[x.start() + 2:]
    else:
        break

while True:
    x = re.search("I[\d%-]", ocr_output)
    if x:
        ocr_output = ocr_output[:x.start()] + '1' + ocr_output[x.start() + 1:]
    else:
        break

while True:
    x = re.search("[A-Z]1", ocr_output)
    if x:
        ocr_output = ocr_output[:x.start()+1] + 'I' + ocr_output[x.start() + 2:]
    else:
        break

while True:
    x = re.search("1[A-Z]", ocr_output)
    if x:
        ocr_output = ocr_output[:x.start()] + 'I' + ocr_output[x.start() + 1:]
    else:
        break
    
print(ocr_output)

>>>TIME: TIME: I TO 5 PM, 1% OFF, TEMP: -1 DEGREES, 1107 OZ, INDEPENDENCE DAY, II A.M.

对于这些情况，您能想到什么更优雅的解决方案来更正我的 OCR 输出？我在 python 工作。谢谢！

Answer 1

这是我想出的：

def preprocess_ocr_output(text: str) -> str:
    output = text
    output = re.sub(r"1(?![\s%])(?=\w+)", "I", output)
    output = re.sub(r"(?<=\w)(?<![\s+\-])1", "I", output)
    output = re.sub(r"I(?!\s)(?=[\d%])", "1", output)
    output = re.sub(r"(?<=[+\-\d])(?<!\s)I", "1", output)
    return output

a solitary I--these will all be 1's; e.g, TIME: I to 5 PM

我不认为你可以保存浮动的“I”->“1”而不在其他地方引起问题...

Answer 2

我会在一般情况下使用这些正则表达式。出于性能原因，不要忘记预编译正则表达式。

import re

I_regex = re.compile(r"(?<=[%I0-9\-+])I|I(?=[%I0-9])")
One_regex = re.compile(r"(?<=[A-Z])1|1(?=[A-Z])|1(?=[a-z])")
def preprocess(text):
    output = I_regex.sub('1', text)
    output = One_regex.sub('I', output)
    return output

输出：

>>> preprocess('TIME: I TO 5 PM, I% OFF, TEMP: -I DEGREES, II07 OZ, 1NDEPENDENCE DAY, II A.M.')
'TIME: I TO 5 PM, 1% OFF, TEMP: -1 DEGREES, 1107 OZ, INDEPENDENCE DAY, 11 A.M.'

使用正则表达式更正 OCR 输出相同的字符（大写 I 与 1 等）

Using regex to correct OCR output identical chars (capital I vs. 1, etc.)

python

regex

ocr