Python 正则表达式:仅当单词前面有 space 和逗号或者单词是起始单词时

Python Regular Expressions: Only if the word has a space and comma in front or if the word is a start word

对于这样的给定字符串:

'Rob and Amber Mariano, Heather Robinson, Jane and John Smith, Kiwan and Nichols Brady John, Jimmy Nichols, Melanie Carbone, Jim Green and Nancy Brown, Todd and Sana Clegg with Tatiana Perkin'

我想确定可能被称为“John and Jane Doe”[=28 的夫妇或其他家庭成员=]排除 个案例,例如“Jim Green 和 Nancy Brown”

我只想识别以下内容:

Rob and Amber Mariano, Jane and John Smith, Kiwan and Nicholas Brady John, Todd and Sana Clegg

下面的正则表达式中的组似乎捕获了我想要的大部分情况,但我无法排除“Jim Green”。

我想设置第一个单词是名称的条件,但它要么在字符串的开头,要么在它之前只有空 space 和一个逗号。

但出于某种原因,我的表情不起作用。我希望 ([^|,\s']?) 捕捉到它,但它似乎并没有那样做。

([^|\,\s]?)([A-Z][a-zA-Z]+)(\s*and\s*)([A-Z][a-zA-Z]+)(\s[A-Z][a-zA-Z]+)(\s[A-Z][a-zA-Z]+)?

让我们将答案分解为 2 个简单的步骤。

  1. 将整个字符串转换为一组情侣名。
  2. 获取所有匹配请求模式的情侣。

我们对遵循以下模式的夫妇名字感兴趣:

<Name1> and <Name2> <Last-name> <May-or-may-not-be-words-separated-by-spaces>.

但我们只对每个匹配字符串的 <Name1> and <Name2> <Last-name> 部分感兴趣。现在我们已经定义了我们想要做的事情,下面是相同的代码。

import re

testStr = """Rob and Amber Mariano, Heather Robinson, 
Jane and John Smith, Kiwan and Nichols Brady John, 
Jimmy Nichols, Melanie Carbone, Jim Green and Nancy Brown, 
Todd and Sana Clegg with Tatiana Perkin
"""

# Pattern definition for the match
regExpr = re.compile("^(\w+\sand\s\w+\s\w+)(\s\w)*")

# Remove whitespaces introduced at the beginning due to splitting
coupleList = [s.strip() for s in testStr.split(',')]

# Find all strings that have a matching string, for rest match() returns None
matchedList = [regExpr.match(s) for s in coupleList]

# Select first group which extracts the necessary pattern from every matched string
result = [s.group(1) for s in matchedList if s is not None ]

试试这个...按预期完美运行

(,\s|^)([A-Z][a-z]+\sand\s[A-Z][a-z]+(\s[A-Z][a-z]+)+)

测试脚本:

import re
a=re.findall("(,\s|^)([A-Z][a-z]+\sand\s[A-Z][a-z]+(\s[A-Z][a-z]+)+)","Rob and Amber Mariano, Heather Robinson, Jane and John Smith, Kiwan and Nichols Brady John, Jimmy Nichols, Melanie Carbone, Jim Green and Nancy Brown, Todd and Sana Clegg with Tatiana Perkin")
print(a)

回复:

[('', 'Rob and Amber Mariano', ' Mariano'), (', ', 'Jane and John Smith', ' Smith'), (', ', 'Kiwan and Nichols Brady John', ' John'), (', ', 'Todd and Sana Clegg', ' Clegg')]

有点晚,但可能是最简单的正则表达式

import re

regex = r"(?:, |^)(\w+\sand\s\w+\s\w+)"

test_str = "Rob and Amber Mariano, Heather Robinson, Jane and John Smith, Kiwan and Nichols Brady, John, Jimmy Nichols, Melanie Carbone, Jim Green and Nancy Brown, Todd and Sana Clegg with Tatiana Perkin"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        print (match.group(groupNum))

输出:

Rob and Amber Mariano
Jane and John Smith
Kiwan and Nichols Brady
Todd and Sana Clegg