我可以使用 python 're' 来解析复杂的人名吗?
Can I use python 're' to parse complex human names?
所以我的主要痛点之一是名字理解和拼凑家喻户晓的名字和头衔。我有一个 80% 的解决方案,其中包含一个非常庞大的正则表达式,我今天早上放在一起,我可能不应该为此感到自豪(但无论如何我都以一种病态的方式)正确匹配以下示例:
John Jeffries
John Jeffries, M.D.
John Jeffries, MD
John Jeffries and Jim Smith
John and Jim Jeffries
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
John Jeffries M.D. and Jennifer Holmes CPA
John Jeffries M.D. & Jennifer Holmes CPA
正则表达式匹配器如下所示:
(?P<first_name>\S*\s*)?(?!and\s|&\s)(?P<last_name>[\w-]*\s*)(?P<titles1>,?\s*(?!and\s|&\s)[\w\.]*,*\s*(?!and\s|&\s)[\w\.]*)?(?P<connector>\sand\s|\s*&*\s*)?(?!and\s|&\s)(?P<first_name2>\S*\s*)(?P<last_name2>[\w-]*\s*)?(?P<titles2>,?\s*[\w\.]*,*\s*[\w\.]*)?
(wtf 对吗?)
为方便起见:http://www.pyregex.com/
因此,例如:
'John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
正则表达式生成的组字典如下所示:
connector: &
first_name: John
first_name2: Jennifer
last_name: Jeffries
last_name2: Wilkes-Smith
titles1: , C.P.A., MD
titles2: , DDS, MD
我需要帮助来完成最后一步,它一直困扰着我,理解可能的中间名。
示例包括:
'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
这可能吗?有没有更好的方法在没有机器学习的情况下做到这一点?也许我可以使用 nameparser (在我进入正则表达式兔子洞后发现)而不是某种方式来确定是否有多个名称?以上符合我99.9%的情况,所以我觉得值得完成。
TLDR: I can't figure out if I can use some sort of lookahead or lookbehind to make sure that the possible middle name only matches if
there is a last name after it.
注意:我不需要解析 Mr. Mrs. Ms. 等头衔,但我想可以像添加中间名一样添加。
Solution Notes: First, follow Richard's advice and don't do this. Second, investigate NLTK or use/contribute to nameparser for a more robust solution if necessary.
像这样的正则表达式是 Dark One 的杰作。
谁在稍后查看您的代码时能够理解发生了什么?你会吗?
您将如何测试所有可能的边缘情况?
您为什么选择使用正则表达式?如果您正在使用的工具很难使用,则表明也许其他工具会更好。
试试这个:
import re
examples = [
"John Jeffries",
"John Jeffries, M.D.",
"John Jeffries, MD",
"John Jeffries and Jim Smith",
"John and Jim Jeffries",
"John Jeffries & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries M.D. and Jennifer Holmes CPA",
"John Jeffries M.D. & Jennifer Holmes CPA",
'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD',
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
]
def IsTitle(inp):
return re.match('^([A-Z]\.?)+$',inp.strip())
def ParseName(name):
#Titles are separated from each other and from names with ","
#We don't need these, so we remove them
name = name.replace(',',' ')
#Split name and titles on spaces, combining adjacent spaces
name = name.split()
#Build an output object
ret_name = {"first":None, "middle":None, "last":None, "titles":[]}
#First string is always a first name
ret_name['first'] = name[0]
if len(name)>2: #John Johnson Smith/PhD
if IsTitle(name[2]): #John Smith PhD
ret_name['last'] = name[1]
ret_name['titles'] = name[2:]
else: #John Johnson Smith, PhD, MD
ret_name['middle'] = name[1]
ret_name['last'] = name[2]
ret_name['titles'] = name[3:]
elif len(name) == 2: #John Johnson
ret_name['last'] = name[1]
return ret_name
def CombineNames(inp):
if not inp[0]['last']:
inp[0]['last'] = inp[1]['last']
def ParseString(inp):
inp = inp.replace("&","and") #Names are combined with "&" or "and"
inp = re.split("\s+and\s+",inp) #Split names apart
inp = map(ParseName,inp)
CombineNames(inp)
return inp
for e in examples:
print e
print ParseString(e)
输出:
John Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, M.D.
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, MD
[{'middle': None, 'titles': ['MD'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries and Jim Smith
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Smith', 'first': 'Jim'}]
John and Jim Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'Jim'}]
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['CPA'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries M.D. and Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jeffries M.D. & Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': 'Jimmy', 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': 'Jenny', 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
整个过程不到十五分钟,而且每个阶段逻辑清晰,程序可以分段调试。虽然单行代码很可爱,但应优先考虑清晰度和可测试性。
所以我的主要痛点之一是名字理解和拼凑家喻户晓的名字和头衔。我有一个 80% 的解决方案,其中包含一个非常庞大的正则表达式,我今天早上放在一起,我可能不应该为此感到自豪(但无论如何我都以一种病态的方式)正确匹配以下示例:
John Jeffries
John Jeffries, M.D.
John Jeffries, MD
John Jeffries and Jim Smith
John and Jim Jeffries
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
John Jeffries M.D. and Jennifer Holmes CPA
John Jeffries M.D. & Jennifer Holmes CPA
正则表达式匹配器如下所示:
(?P<first_name>\S*\s*)?(?!and\s|&\s)(?P<last_name>[\w-]*\s*)(?P<titles1>,?\s*(?!and\s|&\s)[\w\.]*,*\s*(?!and\s|&\s)[\w\.]*)?(?P<connector>\sand\s|\s*&*\s*)?(?!and\s|&\s)(?P<first_name2>\S*\s*)(?P<last_name2>[\w-]*\s*)?(?P<titles2>,?\s*[\w\.]*,*\s*[\w\.]*)?
(wtf 对吗?)
为方便起见:http://www.pyregex.com/
因此,例如:
'John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
正则表达式生成的组字典如下所示:
connector: &
first_name: John
first_name2: Jennifer
last_name: Jeffries
last_name2: Wilkes-Smith
titles1: , C.P.A., MD
titles2: , DDS, MD
我需要帮助来完成最后一步,它一直困扰着我,理解可能的中间名。
示例包括:
'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
这可能吗?有没有更好的方法在没有机器学习的情况下做到这一点?也许我可以使用 nameparser (在我进入正则表达式兔子洞后发现)而不是某种方式来确定是否有多个名称?以上符合我99.9%的情况,所以我觉得值得完成。
TLDR: I can't figure out if I can use some sort of lookahead or lookbehind to make sure that the possible middle name only matches if there is a last name after it.
注意:我不需要解析 Mr. Mrs. Ms. 等头衔,但我想可以像添加中间名一样添加。
Solution Notes: First, follow Richard's advice and don't do this. Second, investigate NLTK or use/contribute to nameparser for a more robust solution if necessary.
像这样的正则表达式是 Dark One 的杰作。
谁在稍后查看您的代码时能够理解发生了什么?你会吗?
您将如何测试所有可能的边缘情况?
您为什么选择使用正则表达式?如果您正在使用的工具很难使用,则表明也许其他工具会更好。
试试这个:
import re
examples = [
"John Jeffries",
"John Jeffries, M.D.",
"John Jeffries, MD",
"John Jeffries and Jim Smith",
"John and Jim Jeffries",
"John Jeffries & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries M.D. and Jennifer Holmes CPA",
"John Jeffries M.D. & Jennifer Holmes CPA",
'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD',
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
]
def IsTitle(inp):
return re.match('^([A-Z]\.?)+$',inp.strip())
def ParseName(name):
#Titles are separated from each other and from names with ","
#We don't need these, so we remove them
name = name.replace(',',' ')
#Split name and titles on spaces, combining adjacent spaces
name = name.split()
#Build an output object
ret_name = {"first":None, "middle":None, "last":None, "titles":[]}
#First string is always a first name
ret_name['first'] = name[0]
if len(name)>2: #John Johnson Smith/PhD
if IsTitle(name[2]): #John Smith PhD
ret_name['last'] = name[1]
ret_name['titles'] = name[2:]
else: #John Johnson Smith, PhD, MD
ret_name['middle'] = name[1]
ret_name['last'] = name[2]
ret_name['titles'] = name[3:]
elif len(name) == 2: #John Johnson
ret_name['last'] = name[1]
return ret_name
def CombineNames(inp):
if not inp[0]['last']:
inp[0]['last'] = inp[1]['last']
def ParseString(inp):
inp = inp.replace("&","and") #Names are combined with "&" or "and"
inp = re.split("\s+and\s+",inp) #Split names apart
inp = map(ParseName,inp)
CombineNames(inp)
return inp
for e in examples:
print e
print ParseString(e)
输出:
John Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, M.D.
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, MD
[{'middle': None, 'titles': ['MD'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries and Jim Smith
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Smith', 'first': 'Jim'}]
John and Jim Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'Jim'}]
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['CPA'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries M.D. and Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jeffries M.D. & Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': 'Jimmy', 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': 'Jenny', 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
整个过程不到十五分钟,而且每个阶段逻辑清晰,程序可以分段调试。虽然单行代码很可爱,但应优先考虑清晰度和可测试性。