正则表达式总是贪婪的，即使我给它 non-capturing 括号？

Question

我有这样的字符串：

strings = [
'title : Booking things author J smith',
'title : Unbe God author:  K. sweet'
]

字符串在 "title" 和标题之间以及 "author" 和作者之间可能有也可能没有冒号。但他们将始终使用 "title" 和 "author".

字样

我想像这样捕获标题：Booking things，和 Unbe God。

我有两个正则表达式：

regex1 = '(?:title\s*:?\s*)[\w\s]+(?=author)'
regex2 = '(?<=title)(?:\s*:?\s*)[\w\s]+(?=author)' # bad because regex is greedy?

结果是：
正则表达式 1:

import re
re.findall(regex1, string, flags=re.I)
['title : Booking things ']

正则表达式 2：

import re
re.findall(regex2, string, flags=re.I)
[' : Booking things ']

对于第一个 regex1，我认为 non-capturing (?:) 会告诉它不要捕获单词 title。我如何告诉它在不使用 lookbehind 的情况下不捕获单词 title？

在第二个 regex2 中，我确实使用了 look-behind，但后来我遇到了类似的问题。我如何告诉它不捕获 :，但仍向后查找单词 title？我也在避免 look-behind 必须是 fixed-width.

的事实

Answer 1

I thought that the non-capturing (?:) would tell it not to capture the word title

Non-capturing 组 仍在消费文本。他们只是匹配（抓取文本并添加到匹配结果中），不捕获（=将匹配值的一部分存储在特定的编号或命名缓冲区）。要检查 presence/absence，只有 lookarounds（或锚点）。

显然您想从匹配项中丢弃前缀 title :。您不能使用后视，因为在 Python re 模块中不允许使用 variable-width 后视（内部带有量词的）。通常的解决方法是围绕您需要获取的模式使用 捕获组。

您可以围绕 [\w\s]+ 子模式设置捕获组以将该值捕获到组 1 中：

import re
strings = [
'title : Booking things author J smith',
'title : Unbe God author:  K. sweet'
]
for x in strings:
    m = re.search(r"(?:title\s*:?\s*)([\w\s]+)(?=author)", x)
    if m:
        print(m.group(1))

sample demo 的输出：

Booking things 
Unbe God

请注意，如果您想去除捕获文本中的尾随空格，请使用稍作调整的正则表达式：

(?:title\s*:?\s*)([\w\s]+?)\s+(?=author)
                         ^

参见 regex demo。 ? 使 [\w\s]+ 子模式惰性并将在 1 个或多个空格（\s+）之前匹配尽可能少的字符文字字符序列 author.

正则表达式总是贪婪的，即使我给它 non-capturing 括号？

Regex is always greedy, even when I give it non-capturing parentheses?

python

regex

regex-lookarounds