原子分组失败更快是什么意思

What does it mean that there is faster failure with atomic grouping

注意 :- 这个问题有点长,因为它包括书中的一个部分。

我正在阅读 Mastering Regular Expressionatomic groups

假定 atomic groups 会导致 更快的失败 。引用书中的特定部分

Faster failures with atomic grouping. Consider ^\w+: applied to Subject. We can see, just by looking at it, that it will fail because the text doesn’t have a colon in it, but the regex engine won’t reach that conclusion until it actually goes through the motions of checking.

So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match). When then checked at the end of the string, : fails, so the regex engine backtracks to the most recently saved state:

at which point the : fails again, this time trying to match t. This backtrack-test fail cycle happens all the way back to the oldest state:

After the attempt from the final state fails, overall failure can finally be announced. All that backtracking is a lot of work that after just a glance we know to be unnecessary. If the colon can’t match after the last letter, it certainly can’t match one of the letters the + is forced to give up!

So, knowing that none of the states left by \w+, once it’s finished, could possibly lead to a match, we can save the regex engine the trouble of checking them: ^(?>\w+):. By adding the atomic grouping, we use our global knowledge of the regex to enhance the local working of \w+ by having its saved states (which we know to be useless) thrown away. If there is a match, the atomic grouping won’t have mattered, but if there’s not to be a match, having thrown away the useless states lets the regex come to that conclusion more quickly.


我尝试了这些正则表达式 here^\w+: 需要 4 步,^(?>\w+): 需要 6 步 (禁用内部引擎优化)


我的问题

  1. 在上一节的第二段中,提到

So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match).When then checked at the end of the string, : fails, so the regex engine backtracks to the most recently saved state:

at which point the : fails again, this time trying to match t. This backtrack-test fail cycle happens all the way back to the oldest state:

但是在 this 网站上,我看不到回溯。为什么?

内部是否进行了一些优化(即使在禁用之后)?

  1. 正则表达式执行的步骤数能否决定一个正则表达式是否比其他正则表达式具有更好的性能?

该站点上的调试器似乎掩盖了回溯的细节。 RegexBuddy 做得更好。这是 ^\w+:

的显示内容

\w+消耗完所有字母后,尝试匹配:失败。然后它返回一个字符,再次尝试 :,然后再次失败。依此类推,直到没有什么可以回馈为止。共十五步。现在看原子版本(^(?>\w+):):

第一次匹配:失败后,一次返回所有字母,就好像它们是一个字符一样。总共五个步骤,其中两个是进入和离开组。使用所有格量词 (^\w++:) 甚至可以消除那些:

关于你的第二个问题,是的,正则表达式调试器的步数指标很有用,特别是如果你刚刚学习正则表达式。每个正则表达式风格至少有一些优化,即使是写得不好的正则表达式也能充分执行,但是调试器(尤其是像 RegexBuddy 这样的中性风格调试器)会让你在做错事时显而易见。