使用正则表达式提取学位差异之间的名称 Python

Question

此代码在从学位之间提取完整姓名时遇到问题，例如 Richard 博士，MM 或 Bobby Richard Klaus 博士，MM 或 Richar，MM。学历不仅有Dr.还有Dr., Dra., Prof., Drs, Prof. Dr., M.Ag and ME.

输出会像这样

目标结果

Complete Names	Names (?)
Dr. RICHARD, MM	Richard
Dra. BOBBY Richard Klaus, MM	Bobby Richard Klaus
Richard, MM	Richard

但实际上，结果应该是这样的

实际结果

Complete Names	Names
Dr. Richard, MM	Richard
Dra. Bobby Richard Klaus, MM	Richard Klaus
Richard, MM	Richard, MM

使用此代码

def extract_names(text):
   """ fix capitalize """
   text = re.sub(r"(_|-)+"," ", text).title()
   """ find name between whitespace and comma """
   text = re.findall("\s[A-Z]\w+(?:\s[A-Z]\w+?)?\s(?:[A-Z]\w+?)?[\s\.\,\;\:]", text)
   text = ' '.join(text[0].split(","))

然后还有一个问题，错误

11 text = ' '.join(text[0].split(",")) 12 return text 13 # def extract_names(text):

IndexError: list index out of range

Answer 1

你可以使用

ads = r'(?:Dr[sa]?|Prof|M\.Ag|M[EM])\.?'
result = re.sub(fr'^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+$', '', text, flags=re.I)

参见regex demo。

(?:Dr[sa]?|Prof|M\.Ag|M[EM])\.? 模式匹配 Dr、Drs、Dra、Prof、M.Ag、ME、MM 可选地后跟 ..

^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+$ 主模式匹配

^(?:\s*{ads})+\s* - 字符串开头，然后是一个或多个零个或多个空格序列和 ads 模式，然后是零个或多个空格
| - 或
\s*, - 零个或多个空格和一个逗号
(?:\s*{ads})+ - 一次或多次重复零个或多个空格和 ads 模式
$ - 字符串结尾

使用正则表达式提取学位差异之间的名称 Python

Extract names between Academic Degree Variances using Regex Python

python

regex

extract

regex-group