Python 正则表达式：贪婪模式返回多个空匹配

Question

这个模式只是为了抓取字符串中的所有内容，直到数据中的第一个潜在句子边界：

[^\.?!\r\n]*

输出：

>>> pattern = re.compile(r"([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!") # Actual source snippet, not a personal comment about Australians. :-)
>>> print matches
['Australians go hard', '', '', '', '']

来自 Python 文档：

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

现在，如果从左到右扫描字符串并且 * 运算符是贪婪的，那么返回的第一个匹配项是整个字符串，直到感叹号。但是，在消耗掉该部分之后，我看不出该模式是如何恰好四次产生空匹配的，大概是通过在 "d" 之后向左扫描字符串。我明白 * 运算符意味着这个模式可以匹配空字符串，我只是不明白它是如何在字母的尾随 "d" 和前导“！”之间多次这样做的。标点符号。

添加 ^ 锚点具有以下效果：

>>> pattern = re.compile(r"^([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!")
>>> print matches
['Australians go hard']

由于这消除了空字符串匹配，这似乎表明所述空匹配出现在字符串的前导 "A" 之前。但这似乎与有关按找到的顺序返回的匹配项的文档相矛盾（前导 "A" 之前的匹配项应该是第一个），而且，恰好四个空匹配项再次让我感到困惑。

Answer 1

* 量词允许模式捕获长度为零的子串。在您的原始代码版本中（前面没有 ^ 锚点），其他匹配项是：

hard结尾和第一个!
第一个和第二个之间的零长度字符串!
第二个和第三个之间的零长度字符串!
第三个!和文本末尾

如果你愿意，你可以slice/dice进一步here。

现在将 ^ 锚点添加到前面可确保只有一个子字符串可以匹配该模式，因为输入文本的开头恰好出现一次。

Python 正则表达式：贪婪模式返回多个空匹配

Python regex: greedy pattern returning multiple empty matches

python

regex

pattern-matching