如何按照特定模式替换单词中的歧义字符
How to replace ambiguous characters in words following specific patern
我使用 tesseract OCR 从不同的文档中提取一些文本,然后使用 Regex 处理提取的文本以查看它是否与特定模式匹配。不幸的是,OCR 提取在歧义字符上会犯常见的错误,例如:5:S,1:I,0:O,2:Z,4:A,8:B 等。这些错误是如此常见,以至于用歧义替换字符将与模式完美匹配。
有没有办法按照特定模式对 OCR 提取进行后处理并替换不明确的字符(提前提供)?
预期输出(以及目前我能想到的):
# example: I am extracting car plate numbers that always follow patern [A-Z]{2}\d{5}
# patterns might differ for other example, but will always be some alfa-numeric combination
# complex patterns may be ignored with some warning like "unable to parse"
import re
def post_process(pattern, text, ambiguous_dict):
# get text[0], check pattern
# in this case, should be letter, if no, try to replace from dict, if yes, pass
# continue with next letters until a match is found or looped the whole text
if match:
return match
else:
# some error message
return None
ambiguous_dict = {'2': 'Z', 'B': '8'}
# My plate photo text: AZ45287
# Noise is fairly easy to filter out by filtering on tesseract confidence level, although not ideal
# so, if a function cannot be made that would find a match through the noise
# the noise can be ignored in favor of a simpler fucntion that can just find a match
ocr_output = "someNoise A2452B7 no1Ze"
# 2 in position 1is replaced by Z, B is replaced by 8. It would be acceptable if the function will
# while '2' on pos 5 should remain a 2 as per pattern
# do this iteratively for each element of ocr_output until pattern is matched or return None
# Any other functionally similar (recursive, generator, other) approach is also acceptable.
result = post_process(r"[A-Z]{2}\d{5}", ocr_output, ambiguous_dict)
if result:
print(result) # AZ45287
else: # result is none
print("failed to clean output")
我希望我已经很好地解释了我的问题,但可以随意请求更多信息
与 OCR 一样,很难提出 100% 安全且有效的解决方案。在这种情况下,您可以做的是将“损坏的”字符添加到正则表达式中,然后使用带有替换项的字典“规范化”匹配项。
意思就是不能用[A-Z]{2}\d{5}
因为前两个大写字母之间可以有一个8
,五个数字之间可以有一个B
.因此,您需要在此处将模式更改为 ([A-Z2]{2})([\dB]{5})
。请注意创建两个子组的捕获括号。要对每个进行标准化,您需要两个单独的替换,因为您似乎不想用数字部分 (\d{5}
) 中的字母替换数字,也不想用字母部分 ([A-Z]{2}
) 中的数字替换字母。
因此,这是在 Python 中的实现方式:
import re
def post_process(pattern, text, ambiguous_dict_1, ambiguous_dict_2):
matches = list(re.finditer(pattern, text))
if len(matches):
return [f"{x.group(1).translate(ambiguous_dict_1)}{x.group(2).translate(ambiguous_dict_2)}" for x in matches]
else:
return None
ambiguous_dict_1 = {ord('2'): 'Z'} # For the first group
ambiguous_dict_2 = {ord('B'): '8'} # For the second group
ocr_output = "someNoise A2452B7 no1Ze"
result = post_process(r"([A-Z2]{2})([\dB]{5})", ocr_output, ambiguous_dict_1, ambiguous_dict_2)
if result:
print(result) # AZ45287
else: # result is none
print("failed to clean output")
# => ['AZ45287']
ambiguous_dict_1
词典包含数字到字母的替换,ambiguous_dict_2
包含字母到数字的替换。
我使用 tesseract OCR 从不同的文档中提取一些文本,然后使用 Regex 处理提取的文本以查看它是否与特定模式匹配。不幸的是,OCR 提取在歧义字符上会犯常见的错误,例如:5:S,1:I,0:O,2:Z,4:A,8:B 等。这些错误是如此常见,以至于用歧义替换字符将与模式完美匹配。
有没有办法按照特定模式对 OCR 提取进行后处理并替换不明确的字符(提前提供)?
预期输出(以及目前我能想到的):
# example: I am extracting car plate numbers that always follow patern [A-Z]{2}\d{5}
# patterns might differ for other example, but will always be some alfa-numeric combination
# complex patterns may be ignored with some warning like "unable to parse"
import re
def post_process(pattern, text, ambiguous_dict):
# get text[0], check pattern
# in this case, should be letter, if no, try to replace from dict, if yes, pass
# continue with next letters until a match is found or looped the whole text
if match:
return match
else:
# some error message
return None
ambiguous_dict = {'2': 'Z', 'B': '8'}
# My plate photo text: AZ45287
# Noise is fairly easy to filter out by filtering on tesseract confidence level, although not ideal
# so, if a function cannot be made that would find a match through the noise
# the noise can be ignored in favor of a simpler fucntion that can just find a match
ocr_output = "someNoise A2452B7 no1Ze"
# 2 in position 1is replaced by Z, B is replaced by 8. It would be acceptable if the function will
# while '2' on pos 5 should remain a 2 as per pattern
# do this iteratively for each element of ocr_output until pattern is matched or return None
# Any other functionally similar (recursive, generator, other) approach is also acceptable.
result = post_process(r"[A-Z]{2}\d{5}", ocr_output, ambiguous_dict)
if result:
print(result) # AZ45287
else: # result is none
print("failed to clean output")
我希望我已经很好地解释了我的问题,但可以随意请求更多信息
与 OCR 一样,很难提出 100% 安全且有效的解决方案。在这种情况下,您可以做的是将“损坏的”字符添加到正则表达式中,然后使用带有替换项的字典“规范化”匹配项。
意思就是不能用[A-Z]{2}\d{5}
因为前两个大写字母之间可以有一个8
,五个数字之间可以有一个B
.因此,您需要在此处将模式更改为 ([A-Z2]{2})([\dB]{5})
。请注意创建两个子组的捕获括号。要对每个进行标准化,您需要两个单独的替换,因为您似乎不想用数字部分 (\d{5}
) 中的字母替换数字,也不想用字母部分 ([A-Z]{2}
) 中的数字替换字母。
因此,这是在 Python 中的实现方式:
import re
def post_process(pattern, text, ambiguous_dict_1, ambiguous_dict_2):
matches = list(re.finditer(pattern, text))
if len(matches):
return [f"{x.group(1).translate(ambiguous_dict_1)}{x.group(2).translate(ambiguous_dict_2)}" for x in matches]
else:
return None
ambiguous_dict_1 = {ord('2'): 'Z'} # For the first group
ambiguous_dict_2 = {ord('B'): '8'} # For the second group
ocr_output = "someNoise A2452B7 no1Ze"
result = post_process(r"([A-Z2]{2})([\dB]{5})", ocr_output, ambiguous_dict_1, ambiguous_dict_2)
if result:
print(result) # AZ45287
else: # result is none
print("failed to clean output")
# => ['AZ45287']
ambiguous_dict_1
词典包含数字到字母的替换,ambiguous_dict_2
包含字母到数字的替换。