为什么 /\w+:/ 和 /\S+:/ 处理回溯的方式不同？

Question

我使用 regex101 分析了这两个正则表达式。我觉得/\S+:/的回溯是对的。但我无法理解这种差异。我错了吗？

Answer 1

虽然这似乎是特定于实现的（RegexBuddy 未显示此行为），但可以解释如下：

\w 无法匹配 :，但 \S 可以。因此，\S+: 需要在确保 get 无法匹配之前检查输入字符串的更多变体。

更优化的正则表达式引擎将更快地排除不可能的匹配（例如，当正则表达式包含匹配的当前部分中不存在的文字字符时），但显然 regex101 使用的引擎没有做那。

Answer 2

这是一个名为 auto-possessification 的 pcre 优化。

来自 http://pcre.org/pcre.txt:

PCRE's "auto-possessification" optimization usually applies to character repeats at the end of a pattern (as well as internally). For example, the pattern "a\d+" is compiled as if it were "a\d++" because there is no point even considering the possibility of backtracking into the repeated digits.

和

This is an optimization that, for example, turns a+b into a++b in order to avoid backtracks into a+ that can never be successful.

由于 : 不包含在 \w 中，您的模式被解释为 \w++:（第二个 + 防止回溯，see possessive quantifiers）。避免了额外的回溯状态，因为没有可能匹配的其他状态。

另一方面，: 包含在 \S 中，因此此优化不适用于第二种情况。

PCRETEST

您可以使用 pcretest (there's a Windows version you can download here).

查看差异

模式/\w+:/需要11步并输出：

/\w+:/
--->get accept:
 +0 ^               \w+
 +3 ^  ^            :
 +0  ^              \w+
 +3  ^ ^            :
 +0   ^             \w+
 +3   ^^            :
 +0    ^            \w+
 +0     ^           \w+
 +3     ^     ^     :
 +4     ^      ^    .*
 +6     ^      ^    
 0: accept:

但是，如果我们使用控制动词 (*NO_AUTO_POSSESS)，它会禁用此优化，模式 /(*NO_AUTO_POSSESS)\w+:/ 需要 14 个步骤 并输出：

/(*NO_AUTO_POSSESS)\w+:/
--->get accept:
+18 ^               \w+
+21 ^  ^            :
+21 ^ ^             :
+21 ^^              :
+18  ^              \w+
+21  ^ ^            :
+21  ^^             :
+18   ^             \w+
+21   ^^            :
+18    ^            \w+
+18     ^           \w+
+21     ^     ^     :
+22     ^      ^    .*
+24     ^      ^    
 0: accept:

^{- 比\S+少1步，符合预期，因为\w+不匹配:.}

~~很遗憾regex101不支持这个动词~~

更新： regex101 现在支持这个动词，这里是 link 要比较的 3 个案例：

/\S+:/ (14 步) - https://regex101.com/r/cw7hGh/1/debugger
/\w+:/（10 步）- https://regex101.com/r/cw7hGh/2/debugger
/(*NO_AUTO_POSSESS)\w+:/ (13 步) - https://regex101.com/r/cw7hGh/3/debugger

regex101 调试器：

为什么 /\w+:/ 和 /\S+:/ 处理回溯的方式不同？

Why do /\w+:/ and /\S+:/ handle backtracking differently?

regex

pcre

backtracking

PCRETEST