我可以使用 python 're' 来解析复杂的人名吗?

Can I use python 're' to parse complex human names?

所以我的主要痛点之一是名字理解和拼凑家喻户晓的名字和头衔。我有一个 80% 的解决方案,其中包含一个非常庞大的正则表达式,我今天早上放在一起,我可能不应该为此感到自豪(但无论如何我都以一种病态的方式)正确匹配以下示例:

John Jeffries
John Jeffries, M.D.
John Jeffries, MD
John Jeffries and Jim Smith
John and Jim Jeffries
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
John Jeffries M.D. and Jennifer Holmes CPA
John Jeffries M.D. & Jennifer Holmes CPA

正则表达式匹配器如下所示:

(?P<first_name>\S*\s*)?(?!and\s|&\s)(?P<last_name>[\w-]*\s*)(?P<titles1>,?\s*(?!and\s|&\s)[\w\.]*,*\s*(?!and\s|&\s)[\w\.]*)?(?P<connector>\sand\s|\s*&*\s*)?(?!and\s|&\s)(?P<first_name2>\S*\s*)(?P<last_name2>[\w-]*\s*)?(?P<titles2>,?\s*[\w\.]*,*\s*[\w\.]*)?

(wtf 对吗?)

为方便起见:http://www.pyregex.com/

因此,例如:

'John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'

正则表达式生成的组字典如下所示:

connector: &
first_name: John
first_name2: Jennifer
last_name: Jeffries
last_name2: Wilkes-Smith
titles1: , C.P.A., MD
titles2: , DDS, MD

我需要帮助来完成最后一步,它一直困扰着我,理解可能的中间名。

示例包括:

'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'

这可能吗?有没有更好的方法在没有机器学习的情况下做到这一点?也许我可以使用 nameparser (在我进入正则表达式兔子洞后发现)而不是某种方式来确定是否有多个名称?以上符合我99.9%的情况,所以我觉得值得完成。

TLDR: I can't figure out if I can use some sort of lookahead or lookbehind to make sure that the possible middle name only matches if there is a last name after it.

注意:我不需要解析 Mr. Mrs. Ms. 等头衔,但我想可以像添加中间名一样添加。

Solution Notes: First, follow Richard's advice and don't do this. Second, investigate NLTK or use/contribute to nameparser for a more robust solution if necessary.

像这样的正则表达式是 Dark One 的杰作。

谁在稍后查看您的代码时能够理解发生了什么?你会吗?

您将如何测试所有可能的边缘情况?

您为什么选择使用正则表达式?如果您正在使用的工具很难使用,则表明也许其他工具会更好。

试试这个:

import re

examples = [
  "John Jeffries",
  "John Jeffries, M.D.",
  "John Jeffries, MD",
  "John Jeffries and Jim Smith",
  "John and Jim Jeffries",
  "John Jeffries & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries M.D. and Jennifer Holmes CPA",
  "John Jeffries M.D. & Jennifer Holmes CPA",
  'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD',
  'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
]

def IsTitle(inp):
  return re.match('^([A-Z]\.?)+$',inp.strip())

def ParseName(name):
  #Titles are separated from each other and from names with ","
  #We don't need these, so we remove them
  name = name.replace(',',' ') 
  #Split name and titles on spaces, combining adjacent spaces
  name = name.split()
  #Build an output object
  ret_name = {"first":None, "middle":None, "last":None, "titles":[]}
  #First string is always a first name
  ret_name['first'] = name[0]
  if len(name)>2: #John Johnson Smith/PhD
    if IsTitle(name[2]): #John Smith PhD
      ret_name['last']   = name[1]
      ret_name['titles'] = name[2:]
    else:                #John Johnson Smith, PhD, MD
      ret_name['middle'] = name[1]
      ret_name['last']   = name[2]
      ret_name['titles'] = name[3:]
  elif len(name) == 2:   #John Johnson
    ret_name['last'] = name[1]
  return ret_name

def CombineNames(inp):
  if not inp[0]['last']:
    inp[0]['last'] = inp[1]['last']

def ParseString(inp):
  inp = inp.replace("&","and")    #Names are combined with "&" or "and"
  inp = re.split("\s+and\s+",inp) #Split names apart
  inp = map(ParseName,inp)
  CombineNames(inp)
  return inp

for e in examples:
  print e
  print ParseString(e)

输出:

John Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, M.D.
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, MD
[{'middle': None, 'titles': ['MD'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries and Jim Smith
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Smith', 'first': 'Jim'}]
John and Jim Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'Jim'}]
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['CPA'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries M.D. and Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jeffries M.D. & Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': 'Jimmy', 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': 'Jenny', 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]

整个过程不到十五分钟,而且每个阶段逻辑清晰,程序可以分段调试。虽然单行代码很可爱,但应优先考虑清晰度和可测试性。