正则表达式错误和改进驾驶执照数据提取

Question

我正在尝试从我使用 Pytesseract 处理的图像中提取名称、许可证号、颁发日期和有效性。我对正则表达式很困惑，但仍然通过网络浏览了一些文档和代码。

我到了这里：

import pytesseract
import cv2
import re

import cv2

from PIL import Image
import numpy as np
import datetime

from dateutil.relativedelta import relativedelta

def driver_license(filename):  
    """
    This function will handle the core OCR processing of images.
    """
    
    i = cv2.imread(filename)
    newdata=pytesseract.image_to_osd(i)
    angle = re.search('(?<=Rotate: )\d+', newdata).group(0)
    angle = int(angle)
    i = Image.open(filename)
    if angle != 0:
       #with Image.open("ro2.jpg") as i:
        rot_angle = 360 - angle
        i = i.rotate(rot_angle, expand="True")
        i.save(filename)
    
    i = cv2.imread(filename)
    # Convert to gray
    i = cv2.cvtColor(i, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    i = cv2.dilate(i, kernel, iterations=1)
    i = cv2.erode(i, kernel, iterations=1)
    
    txt = pytesseract.image_to_string(i)
    print(txt)
        
    text = []
    data = {
        'firstName': None,
        'lastName': None,
        'age': None,
        'documentNumber': None
    }
    
    c = 0
    print(txt)
    
    #Splitting lines
    lines = txt.split('\n')
    
    for lin in lines:
        c = c + 1
        s = lin.strip()
        s = s.replace('\n','')
        if s:
            s = s.rstrip()
            s = s.lstrip()
            text.append(s)

            try:
                if re.match(r".*Name|.*name|.*NAME", s):           
                    name = re.sub('[^a-zA-Z]+', ' ', s)
                    name = name.replace('Name', '')
                    name = name.replace('name', '')
                    name = name.replace('NAME', '')
                    name = name.replace(':', '')
                    name = name.rstrip()
                    name = name.lstrip()
                    nmlt = name.split(" ")
                    data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
                    data['lastName'] = nmlt[-1]
                if re.search(r"[a-zA-Z][a-zA-Z]-\d{13}", s):
                    data['documentNumber'] = re.search(r'[a-zA-Z][a-zA-Z]-\d{13}', s)
                    data['documentNumber'] = data['documentNumber'].group().replace('-', '')
                    if not data['firstName']:
                        name = lines[c]           
                        name = re.sub('[^a-zA-Z]+', ' ', name)
                        name = name.rstrip()
                        name = name.lstrip()
                        nmlt = name.split(" ")
                        data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
                        data['lastName'] = nmlt[-1]
                if re.search(r"[a-zA-Z][a-zA-Z]\d{2} \d{11}", s):
                    data['documentNumber'] = re.search(r'[a-zA-Z][a-zA-Z]\d{2} \d{11}', s)
                    data['documentNumber'] = data['documentNumber'].group().replace(' ', '')
                    if not data['firstName']:
                        name = lines[c]           
                        name = re.sub('[^a-zA-Z]+', ' ', name)
                        name = name.rstrip()
                        name = name.lstrip()
                        nmlt = name.split(" ")
                        data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
                        data['lastName'] = nmlt[-1]
                if re.match(r".*DOB|.*dob|.*Dob", s):         
                    yob = re.sub('[^0-9]+', ' ', s)
                    yob = re.search(r'\d\d\d\d', yob)
                    data['age'] = datetime.datetime.now().year - int(yob.group())
            except:
                pass

    print(data)

我还需要提取有效期和签发日期。但不能靠近它。另外，我已经看到使用正则表达式缩短了很多代码，那么有没有更好的优化方法呢？

我的输入数据是一个有点像这样的字符串：

Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India

Licence No. : DL-0820100052000 (P) R
N : PARMINDER PAL SINGH GILL

: SHRI DARSHAN SINGH GILL

DOB: 10/05/1966 BG: U
Address :

104 SHARDA APPTT WEST ENCLAVE
PITAMPURA DELHI 110034

  

Auth to Drive Date of Issue
M.CYL. 24/02/2010
LMV-NT 24/02/2010

(Holder's Sig natu re)

Issue Date : 20/05/2016
Validity(NT) : 19/05/2021 : c
Validity(T) : NA Issuing Authority
InvCarrNo : NA NWZ-I, WAZIRPUR

或者像这样：

in

Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India

2

   
    
   

Licence No. : DL-0320170595326 () WN
Name : AZAZ AHAMADSIDDIQUIE
s/w/D : SALAHUDDIN ALI
____... DOB: 26/12/1992 BG: O+
\ \ Address:
—.~J ~—; ROO NO-25 AMK BOYS HOSTEL, J.
— NAGAR, DELHI 110025
Auth to Drive Date of Issue
M.CYL. 12/12/2017
4 wt 4
Iseue Date: 12/12/2017 a
falidity(NT) < 2037
Validity(T) +: NA /
Inv CarrNo : NA te sntian sana

注意：在第二个示例中您不会获得有效性，将优化 OCR 以备后用。任何可以帮助我使用更简单的正则表达式的适当指南都会很好。

Answer 1

您可以使用此模式：(?<=KEY\s*:\s*)\b[^\n]+ 并将 KEY 替换为日期、许可证号和其他问题之一。同样对于此模式，您需要使用 regex 库。

代码：

import regex

text1 = """
Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India

Licence No. : DL-0820100052000 (P) R
N : PARMINDER PAL SINGH GILL

: SHRI DARSHAN SINGH GILL

DOB: 10/05/1966 BG: U
Address :

104 SHARDA APPTT WEST ENCLAVE
PITAMPURA DELHI 110034



Auth to Drive Date of Issue
M.CYL. 24/02/2010
LMV-NT 24/02/2010

(Holder's Sig natu re)

Issue Date : 20/05/2016
Validity(NT) : 19/05/2021 : c
Validity(T) : NA Issuing Authority
InvCarrNo : NA NWZ-I, WAZIRPUR
"""

for key in ('Issue Date', 'Licence No\.', 'N', 'Validity\(NT\)'):
    print(regex.findall(fr"(?<={key}\s*:\s*)\b[^\n]+", text1, regex.IGNORECASE))

输出：

['20/05/2016']
['DL-0820100052000 (P) R']
['PARMINDER PAL SINGH GILL']
['19/05/2021 : c']

Answer 2

您还可以将 re 与基于将捕获您的键和值的交替的单个正则表达式一起使用：

import re
text = "Transport Department Government of NCT of Delhi\nLicence to Drive Vehicles Throughout India\n\nLicence No. : DL-0820100052000 (P) R\nN : PARMINDER PAL SINGH GILL\n\n: SHRI DARSHAN SINGH GILL\n\nDOB: 10/05/1966 BG: U\nAddress :\n\n104 SHARDA APPTT WEST ENCLAVE\nPITAMPURA DELHI 110034\n\n\n\nAuth to Drive Date of Issue\nM.CYL. 24/02/2010\nLMV-NT 24/02/2010\n\n(Holder's Sig natu re)\n\nIssue Date : 20/05/2016\nValidity(NT) : 19/05/2021 : c\nValidity(T) : NA Issuing Authority\nInvCarrNo : NA NWZ-I, WAZIRPUR"
search_phrases = ['Issue Date', 'Licence No.', 'N', 'Validity(NT)']
reg = r"\b({})\s*:\W*(.+)".format( "|".join(sorted(map(re.escape, search_phrases), key=len, reverse=True)) )
print(re.findall(reg, text, re.IGNORECASE))

这个短片的输出 online Python demo：

[('Licence No.', 'DL-0820100052000 (P) R'), ('N', 'PARMINDER PAL SINGH GILL'), ('Issue Date', '20/05/2016'), ('Validity(NT)', '19/05/2021 : c')]

正则表达式是

\b(Validity\(NT\)|Licence\ No\.|Issue\ Date|N)\s*:\W*(.+)

参见 its online demo。

详情:

map(re.escape, search_phrases) - 转义搜索短语中的所有特殊字符以用作正则表达式中的文字文本（否则，. 将匹配任何字符，? 将不匹配一个 ? 字符等）
sorted(..., key=len, reverse=True) - 按长度降序对搜索短语进行排序（首先获得较长的匹配项）
"|".join(...) - 创建交替模式，a|b|c|...
r"\b({})\s*:\W*(.+)".format( ... ) - 创建最终的正则表达式。

正则表达式详细信息

\b - 单词边界（注意：如果您的匹配出现在行首，请替换为 (?m)^）
(Validity\(NT\)|Licence\ No\.|Issue\ Date|N) - 第 1 组：搜索短语之一
\s* - 零个或多个空格
: - 冒号
\W* - 零个或多个非单词字符
(.+) -（捕获）第 2 组：除换行字符外的一个或多个字符，尽可能多。

正则表达式错误和改进驾驶执照数据提取

Regex Error and Improvement Driving Licence Data Extraction

regex

python-3.x

python-re