无法从给定格式中提取出生日期

Unable to extract date of birth from a given format

我有一组文本文件,我必须从中提取出生日期。下面的代码能够从大多数文件中提取出生日期,但在以下面的格式给出时会失败。我可以知道如何提取 DOB 吗?数据非常不统一。

数据:

data="""
Thomas, John - DOB/Sex:    12/23/1955                                     11/15/2014   11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970

代码:

import re    
pattern = re.compile(r'.*DOB.*((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?(?:\/|-)\d{2,4})).*',re.I)

matches=pattern.findall(data)

for match in matches:
    print(match)

预期输出:

12/23/1955
import re
string = "DOB/Sex:    12/23/1955            11/15/2014   11:53 AM"
re.findall(r'.*?DOB.*?:\s+([\d/]+)', string)

输出:

['12/23/1955']
import re    

data="""
Thomas, John - DOB/Sex:    12/23/1955                                     11/15/2014   11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970
"""

pattern = re.compile(r'.*?\b(?:DOB|Date of birth)\b.*?(\d{1,2}[/-]\d{1,2}[/-](?:\d\d){1,2})',re.I)

matches=pattern.findall(data)

for match in matches:
    print(match)    

输出:

12/23/1955
9/15/1963
10/30/1970

解释:

.*?             : 0 or more anycharacter but newline
\b              : word boundary
(?:             : start non capture group
  DOB           : literally
 |              : OR
  Date of birth : literally
)               : end group
\b              : word boundary
.*?             : 0 or more anycharacter but newline
(               : start group 1
    \d{1,2}     : 1 or 2 digits
    [/-]        : slash or dash
    \d{1,2}     : 1 or 2 digits
    [/-]        : slash or dash
    (?:         : start non capture group
        \d\d    : 2 digits
    ){1,2}      : end group may appear 1 or twice (ie; 2 OR 4 digits)
)               : end capture group 1