正后视断言中贪婪子表达式的令人惊讶但正确的行为

Question

注:

观察到的行为正确，但一开始可能令人惊讶；对我来说是这样，我认为对其他人来说也可能是这样 - 尽管可能不是那些非常熟悉正则表达式引擎的人。
重复建议的副本，Regex lookahead, lookbehind and atomic groups，包含有关环视断言的一般信息，但不解决手头的具体误解，如下面的评论中更详细地讨论。

根据定义 可变宽度 在 positive look-behind assertion 中使用 greedy 子表达式可以表现出令人惊讶的行为。

为方便起见，这些示例使用 PowerShell，但该行为通常适用于 .NET 正则表达式引擎：

这个命令按照我的直觉预期工作：

# OK:  
#     The subexpression matches greedily from the start up to and
#     including the last "_", and, by including the matched string ($&) 
#     in the replacement string, effectively inserts "|" there - and only there.
PS> 'a_b_c' -replace '^.+_', '$&|'
a_b_|c

以下命令使用正向后视断言 (?<=...)，看似等价 - 但 不是[=46] =]:

# CORRECT, but SURPRISING: # Use a positive lookbehind assertion to *seemingly* match # only up to and including the last "_", and insert a "|" there. PS> 'a_b_c' -replace '(?<=^.+_)', '|' a_|b_|c # !! *multiple* insertions were performed

为什么不等价？为什么要进行多次插入？

Answer 1

tl;博士:

在后向断言中，贪婪子表达式有效表现非贪婪（在全局匹配除了贪婪），由于考虑了输入字符串的 每个前缀字符串 。

我的问题是我没有考虑到，在回溯断言中，必须检查输入字符串中的每个字符位置前面的文本到那个点 以匹配后向断言中的子表达式。

结合 PowerShell 的 -replace 运算符执行的始终全局替换（即执行所有可能的匹配），导致多次次插入：

也就是说，贪婪的锚定子表达式 ^.+_ 在考虑 时合法匹配两次 当前正在考虑的字符位置左侧的文本:

首先，a_ 是左边的文字。
当 a_b_ 是左边的文字时。

因此，两次插入 | 结果。

相比之下，没有后视断言，贪婪表达式^.+_根据定义只匹配一次，直到最后一个 _，因为它只应用于整个输入字符串.

正后视断言中贪婪子表达式的令人惊讶但正确的行为

Surprising, but correct behavior of a greedy subexpression in a positive lookbehind assertion

.net

regex

regex-greedy

regex-lookarounds