正则表达式错误和改进驾驶执照数据提取

Regex Error and Improvement Driving Licence Data Extraction

我正在尝试从我使用 Pytesseract 处理的图像中提取名称、许可证号、颁发日期和有效性。我对正则表达式很困惑,但仍然通过网络浏览了一些文档和代码。

我到了这里:

import pytesseract
import cv2
import re

import cv2

from PIL import Image
import numpy as np
import datetime

from dateutil.relativedelta import relativedelta

def driver_license(filename):  
    """
    This function will handle the core OCR processing of images.
    """
    
    i = cv2.imread(filename)
    newdata=pytesseract.image_to_osd(i)
    angle = re.search('(?<=Rotate: )\d+', newdata).group(0)
    angle = int(angle)
    i = Image.open(filename)
    if angle != 0:
       #with Image.open("ro2.jpg") as i:
        rot_angle = 360 - angle
        i = i.rotate(rot_angle, expand="True")
        i.save(filename)
    
    i = cv2.imread(filename)
    # Convert to gray
    i = cv2.cvtColor(i, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    i = cv2.dilate(i, kernel, iterations=1)
    i = cv2.erode(i, kernel, iterations=1)
    
    txt = pytesseract.image_to_string(i)
    print(txt)
        
    text = []
    data = {
        'firstName': None,
        'lastName': None,
        'age': None,
        'documentNumber': None
    }
    
    c = 0
    print(txt)
    
    #Splitting lines
    lines = txt.split('\n')
    
    for lin in lines:
        c = c + 1
        s = lin.strip()
        s = s.replace('\n','')
        if s:
            s = s.rstrip()
            s = s.lstrip()
            text.append(s)

            try:
                if re.match(r".*Name|.*name|.*NAME", s):           
                    name = re.sub('[^a-zA-Z]+', ' ', s)
                    name = name.replace('Name', '')
                    name = name.replace('name', '')
                    name = name.replace('NAME', '')
                    name = name.replace(':', '')
                    name = name.rstrip()
                    name = name.lstrip()
                    nmlt = name.split(" ")
                    data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
                    data['lastName'] = nmlt[-1]
                if re.search(r"[a-zA-Z][a-zA-Z]-\d{13}", s):
                    data['documentNumber'] = re.search(r'[a-zA-Z][a-zA-Z]-\d{13}', s)
                    data['documentNumber'] = data['documentNumber'].group().replace('-', '')
                    if not data['firstName']:
                        name = lines[c]           
                        name = re.sub('[^a-zA-Z]+', ' ', name)
                        name = name.rstrip()
                        name = name.lstrip()
                        nmlt = name.split(" ")
                        data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
                        data['lastName'] = nmlt[-1]
                if re.search(r"[a-zA-Z][a-zA-Z]\d{2} \d{11}", s):
                    data['documentNumber'] = re.search(r'[a-zA-Z][a-zA-Z]\d{2} \d{11}', s)
                    data['documentNumber'] = data['documentNumber'].group().replace(' ', '')
                    if not data['firstName']:
                        name = lines[c]           
                        name = re.sub('[^a-zA-Z]+', ' ', name)
                        name = name.rstrip()
                        name = name.lstrip()
                        nmlt = name.split(" ")
                        data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
                        data['lastName'] = nmlt[-1]
                if re.match(r".*DOB|.*dob|.*Dob", s):         
                    yob = re.sub('[^0-9]+', ' ', s)
                    yob = re.search(r'\d\d\d\d', yob)
                    data['age'] = datetime.datetime.now().year - int(yob.group())
            except:
                pass

    print(data)
    

我还需要提取有效期和签发日期。但不能靠近它。另外,我已经看到使用正则表达式缩短了很多代码,那么有没有更好的优化方法呢?

我的输入数据是一个有点像这样的字符串:

Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India

Licence No. : DL-0820100052000 (P) R
N : PARMINDER PAL SINGH GILL

: SHRI DARSHAN SINGH GILL

DOB: 10/05/1966 BG: U
Address :

104 SHARDA APPTT WEST ENCLAVE
PITAMPURA DELHI 110034

  

Auth to Drive Date of Issue
M.CYL. 24/02/2010
LMV-NT 24/02/2010

(Holder's Sig natu re)

Issue Date : 20/05/2016
Validity(NT) : 19/05/2021 : c
Validity(T) : NA Issuing Authority
InvCarrNo : NA NWZ-I, WAZIRPUR

或者像这样:

in

Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India

2

   
    
   

Licence No. : DL-0320170595326 () WN
Name : AZAZ AHAMADSIDDIQUIE
s/w/D : SALAHUDDIN ALI
____... DOB: 26/12/1992 BG: O+
\ \ Address:
—.~J ~—; ROO NO-25 AMK BOYS HOSTEL, J.
— NAGAR, DELHI 110025
Auth to Drive Date of Issue
M.CYL. 12/12/2017
4 wt 4
Iseue Date: 12/12/2017 a
falidity(NT) < 2037
Validity(T) +: NA /
Inv CarrNo : NA te sntian sana

注意:在第二个示例中您不会获得有效性,将优化 OCR 以备后用。任何可以帮助我使用更简单的正则表达式的适当指南都会很好。

您可以使用此模式:(?<=KEY\s*:\s*)\b[^\n]+ 并将 KEY 替换为日期、许可证号和其他问题之一。 同样对于此模式,您需要使用 regex 库。

代码:

import regex

text1 = """
Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India

Licence No. : DL-0820100052000 (P) R
N : PARMINDER PAL SINGH GILL

: SHRI DARSHAN SINGH GILL

DOB: 10/05/1966 BG: U
Address :

104 SHARDA APPTT WEST ENCLAVE
PITAMPURA DELHI 110034



Auth to Drive Date of Issue
M.CYL. 24/02/2010
LMV-NT 24/02/2010

(Holder's Sig natu re)

Issue Date : 20/05/2016
Validity(NT) : 19/05/2021 : c
Validity(T) : NA Issuing Authority
InvCarrNo : NA NWZ-I, WAZIRPUR
"""

for key in ('Issue Date', 'Licence No\.', 'N', 'Validity\(NT\)'):
    print(regex.findall(fr"(?<={key}\s*:\s*)\b[^\n]+", text1, regex.IGNORECASE))

输出:

['20/05/2016']
['DL-0820100052000 (P) R']
['PARMINDER PAL SINGH GILL']
['19/05/2021 : c']

您还可以将 re 与基于将捕获您的键和值的交替的单个正则表达式一起使用:

import re
text = "Transport Department Government of NCT of Delhi\nLicence to Drive Vehicles Throughout India\n\nLicence No. : DL-0820100052000 (P) R\nN : PARMINDER PAL SINGH GILL\n\n: SHRI DARSHAN SINGH GILL\n\nDOB: 10/05/1966 BG: U\nAddress :\n\n104 SHARDA APPTT WEST ENCLAVE\nPITAMPURA DELHI 110034\n\n\n\nAuth to Drive Date of Issue\nM.CYL. 24/02/2010\nLMV-NT 24/02/2010\n\n(Holder's Sig natu re)\n\nIssue Date : 20/05/2016\nValidity(NT) : 19/05/2021 : c\nValidity(T) : NA Issuing Authority\nInvCarrNo : NA NWZ-I, WAZIRPUR"
search_phrases = ['Issue Date', 'Licence No.', 'N', 'Validity(NT)']
reg = r"\b({})\s*:\W*(.+)".format( "|".join(sorted(map(re.escape, search_phrases), key=len, reverse=True)) )
print(re.findall(reg, text, re.IGNORECASE))

这个短片的输出 online Python demo

[('Licence No.', 'DL-0820100052000 (P) R'), ('N', 'PARMINDER PAL SINGH GILL'), ('Issue Date', '20/05/2016'), ('Validity(NT)', '19/05/2021 : c')]

正则表达式是

\b(Validity\(NT\)|Licence\ No\.|Issue\ Date|N)\s*:\W*(.+)

参见 its online demo

详情:

  • map(re.escape, search_phrases) - 转义搜索短语中的所有特殊字符以用作正则表达式中的文字文本(否则,. 将匹配任何字符,? 将不匹配一个 ? 字符等)
  • sorted(..., key=len, reverse=True) - 按长度降序对搜索短语进行排序(首先获得较长的匹配项)
  • "|".join(...) - 创建交替模式,a|b|c|...
  • r"\b({})\s*:\W*(.+)".format( ... ) - 创建最终的正则表达式。

正则表达式详细信息

  • \b - 单词边界(注意:如果您的匹配出现在行首,请替换为 (?m)^
  • (Validity\(NT\)|Licence\ No\.|Issue\ Date|N) - 第 1 组:搜索短语之一
  • \s* - 零个或多个空格
  • : - 冒号
  • \W* - 零个或多个非单词字符
  • (.+) -(捕获)第 2 组:除换行字符外的一个或多个字符,尽可能多。