为什么 `\s+` 在这个 Perl 正则表达式中比 `\s\s*` 快很多?

Why is `\s+` so much faster than `\s\s*` in this Perl regex?

为什么用 \s+ 替换 \s*(甚至 \s\s*)会导致此输入的加速?

use Benchmark qw(:all);
$x=(" " x 100000) . "_\n";
$count = 100;
timethese($count, {
    '/\s\s*\n/' => sub { $x =~ /\s\s*\n/ },
    '/\s+\n/' => sub { $x =~ /\s+\n/ },
});

Link to online version

我注意到我的代码中有一个缓慢的正则表达式 s/\s*\n\s*/\n/g - 当给定一个 450KB 的输入文件时,该文件包含大量空格和一些非空格,最后一个换行符 - 正则表达式悬而未决。

我凭直觉用 s/\s+\n/\n/g; s/\n\s+/\n/g; 替换了正则表达式,一切正常。

但为什么会快这么多?使用 re Debug => "EXECUTE" 后,我注意到 \s+ 版本仅在一次迭代中以某种方式优化为 运行:http://pastebin.com/0Ug6xPiQ

Matching REx "\s*\n" against "       _%n"
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against "       _%n" (9 bytes)
   0 <> <       _%n>         |  1:STAR(3)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...
   1 < > <      _%n>         |  1:STAR(3)
                                  SPACE can match 6 times out of 2147483647...
                                  failed...
   2 <  > <     _%n>         |  1:STAR(3)
                                  SPACE can match 5 times out of 2147483647...
                                  failed...
   3 <   > <    _%n>         |  1:STAR(3)
                                  SPACE can match 4 times out of 2147483647...
                                  failed...
   4 <    > <   _%n>         |  1:STAR(3)
                                  SPACE can match 3 times out of 2147483647...
                                  failed...
   5 <     > <  _%n>         |  1:STAR(3)
                                  SPACE can match 2 times out of 2147483647...
                                  failed...
   6 <      > < _%n>         |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   8 <       _> <%n>         |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
   8 <       _> <%n>         |  3:  EXACT <\n>(5)
   9 <       _%n> <>         |  5:  END(0)
Match successful!
Matching REx "\s+\n" against "       _%n"
Matching stclass SPACE against "       _" (8 bytes)
   0 <> <       _%n>         |  1:PLUS(3)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...

我知道如果换行符不存在,Perl 5.10+ 将立即使正则表达式失败(没有 运行ning 它)。我怀疑它正在使用换行符的位置来减少它所做的搜索量。对于上面的所有情况,它似乎巧妙地减少了涉及的回溯(通常 /\s*\n/ 对一串空格将花费指数时间)。任何人都可以深入了解为什么 \s+ 版本要快得多吗?

另请注意,\s*? 不提供任何加速。

我不确定正则表达式引擎的内部结构,但它似乎无法识别 \s+ 在某种程度上 相同\s\s*,因为在第二个中它匹配 space,然后尝试匹配越来越多的 space,而在第一个中,它立即断定没有匹配项。

使用 use re qw( Debug ); 的输出使用更短的字符串清楚地显示了这一点:

test_re.pl

#!/usr/bin/env perl
use re qw(debug);

$x=(" " x 10) . "_\n";
print '-'x50 . "\n";
$x =~ /\s+\n/;
print '-'x50 . "\n";
$x =~ /\s\s*\n/;
print '-'x50 . "\n";

输出

Compiling REx "\s+\n"
Final program:
    1: PLUS (3)
    2:   SPACE (0)
    3: EXACT <\n> (5)
    5: END (0)
floating "%n" at 1..2147483647 (checking floating) stclass SPACE plus minlen 2
Compiling REx "\s\s*\n"
Final program:
    1: SPACE (2)
    2: STAR (4)
    3:   SPACE (0)
    4: EXACT <\n> (6)
    6: END (0)
floating "%n" at 1..2147483647 (checking floating) stclass SPACE minlen 2
--------------------------------------------------
Guessing start of match in sv for REx "\s+\n" against "          _%n"
Found floating substr "%n" at offset 11...
    start_shift: 1 check_at: 11 s: 0 endpos: 11
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+\n" against "          _%n"
Matching stclass SPACE against "          _" (11 bytes)
   0 <> <          >         |  1:PLUS(3)
                                  SPACE can match 10 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed
--------------------------------------------------
Guessing start of match in sv for REx "\s\s*\n" against "          _%n"
Found floating substr "%n" at offset 11...
    start_shift: 1 check_at: 11 s: 0 endpos: 11
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s\s*\n" against "          _%n"
Matching stclass SPACE against "          _" (11 bytes)
   0 <> <          >         |  1:SPACE(2)
   1 < > <         _>        |  2:STAR(4)
                                  SPACE can match 9 times out of 2147483647...
                                  failed...
   1 < > <         _>        |  1:SPACE(2)
   2 <  > <        _>        |  2:STAR(4)
                                  SPACE can match 8 times out of 2147483647...
                                  failed...
   2 <  > <        _>        |  1:SPACE(2)
   3 <   > <       _%n>      |  2:STAR(4)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...
   3 <   > <       _%n>      |  1:SPACE(2)
   4 <    > <      _%n>      |  2:STAR(4)
                                  SPACE can match 6 times out of 2147483647...
                                  failed...
   4 <    > <      _%n>      |  1:SPACE(2)
   5 <     > <     _%n>      |  2:STAR(4)
                                  SPACE can match 5 times out of 2147483647...
                                  failed...
   5 <     > <     _%n>      |  1:SPACE(2)
   6 <      > <    _%n>      |  2:STAR(4)
                                  SPACE can match 4 times out of 2147483647...
                                  failed...
   6 <      > <    _%n>      |  1:SPACE(2)
   7 <       > <   _%n>      |  2:STAR(4)
                                  SPACE can match 3 times out of 2147483647...
                                  failed...
   7 <       > <   _%n>      |  1:SPACE(2)
   8 <        > <  _%n>      |  2:STAR(4)
                                  SPACE can match 2 times out of 2147483647...
                                  failed...
   8 <        > <  _%n>      |  1:SPACE(2)
   9 <         > < _%n>      |  2:STAR(4)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   9 <         > < _%n>      |  1:SPACE(2)
  10 <          > <_%n>      |  2:STAR(4)
                                  SPACE can match 0 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed
--------------------------------------------------
Freeing REx: "\s+\n"
Freeing REx: "\s\s*\n"

首先,即使生成的正则表达式不会保持相同的含义,让我们将正则表达式简化为 \s*0\s+0 并使用 (" " x 4) . "_0" 作为输入。对于持怀疑态度的人,您可以看到 here 延迟仍然存在。

现在让我们考虑以下代码:

$x = (" " x 4) . "_ 0";
$x =~ /\s*0/; # The slow line 
$x =~ /\s+0/; # The fast line

use re debugcolor; 挖掘一下,我们得到以下输出:

Guessing start of match in sv for REx "\s*0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
   0 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
   1 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 3 times out of 2147483647...
                                  failed...
   2 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 2 times out of 2147483647...
                                  failed...
   3 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 1 times out of 2147483647...
                                  failed...
   5 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 0 times out of 2147483647...
   5 <    _0>|  3:  EXACT <0>(5)
   6 <    _0>|  5:  END(0)
Match successful!

-----------------------

Guessing start of match in sv for REx "\s+0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes)
   0 <    _0>|  1:PLUS(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed

Perl 似乎 be optimized for failure。它将首先查找常量字符串(仅消耗 O(N))。在这里,它会寻找 0 : Found floating substr "0" at offset 5...

然后它将从正则表达式的 变量 部分开始,分别是 \s*\s+,针对整个最小字符串进行检查:

Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char

之后它会寻找满足stclass要求的第一个位置,这里是位置0。

  • \s*0:
    • 从0开始,找4个空格失败;
    • 从1开始,找3个空格失败;
    • 从2开始,找2个空格失败;
    • 从3开始,找1个空格失败;
    • 从 4 开始,找到 0 个空格然后 不会失败;
    • 找到确切的 0
  • \s+0:
    • 从0开始,找4个空格失败。由于不匹配最小空格数,正则表达式立即失败。

如果你想享受 Perl 正则表达式优化的乐趣,可以考虑以下正则表达式 / *\n/ * \n。乍一看,它们看起来一样,具有相同的含义......但是如果你 运行 它反对 (" " x 40000) . "_\n" 第一个将检查所有可能性,而第二个将寻找 " \n"并立即失败。

在普通的、未优化的正则表达式引擎中,两种正则表达式都可能导致灾难性的回溯,因为它们需要在模式颠簸时重试模式。但是,在上面的示例中,第二个不会因 Perl 而失败,因为它已被优化为 find floating substr "0%n"


你可以在Jeff Atwood's blog上看到另一个例子。

另请注意,问题与 \s 考虑无关,而是使用 xx* 而不是 x+ 的任何模式,请参阅 example with 0s and also regex explosive quantifiers

对于这样一个较短的例子,行为是 "findable",但是如果你开始玩复杂的模式,就不容易发现,例如:Regular expression hangs program (100% CPU usage)

\s+\n 要求 \n 之前的字符是 SPACE.

根据 use re qw(debug) 编译确定它需要一个已知数量 space 的直字符串,直到子字符串 \n,首先在输入。然后它检查固定长度的 space-only 子字符串与输入的剩余部分,在到达 _ 时失败。无论输入有多少 space,它都是一种检查的可能性。 (当有更多 _\n 时,每个调试输出都同样直接失败。)

从这个角度来看,这是您几乎可以预料到的优化,它利用了相当具体的搜索模式并幸运地获得了此输入。除了与其他引擎进行比较时,其他引擎显然不进行此类分析。

\s*\n 情况并非如此。一旦找到 \n 并且前一个字符不是 space,搜索就不会失败,因为 \s* 不允许任何内容(零字符)。也没有固定长度的子串,在回溯游戏中。

当模式开头有一个“加号”节点(例如\s+)并且该节点无法匹配时,正则表达式引擎会向前跳到失败点并重试;另一方面,对于 \s*,引擎一次只前进一个字符。

Yves Orton 很好地解释了这个优化here:

The start class optimisation has two modes, "try every valid start position" (doevery) and "flip flop mode" (!doevery) where it trys only the first valid start position in a sequence.

Consider /(\d+)X/ and the string "123456Y", now we know that if we fail to match X after matching "123456" then we will also fail to match after "23456" (assuming no evil tricks are in place, which disable the optimisation anyway), so we know we can skip forward until the check /fails/ and only then start looking for a real match. This is flip-flop mode.

/\s+/触发触发器模式; /\s*//\s\s*//\s\s+/ 不会。 此优化不能应用于像 \s* 这样的“星形”节点,因为它们可以匹配零个字符,因此序列中某一点的失败并不表示同一序列中稍后的失败。


您可以在每个正则表达式的调试输出中看到这一点。我用 ^ 突出显示了跳过的字符。比较这个(一次跳过四个字符):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d+x/'
   ...
   0 <> <123 456 78>         |  1:PLUS(3)
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   4 <123 > <456 789 x>      |  1:PLUS(3)
      ^^^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   8 <23 456 > <789 x>       |  1:PLUS(3)
         ^^^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...

至此(一次跳过一两个字符):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d*x/'
   ...
   0 <> <123 456 78>         |  1:STAR(3)
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   1 <1> <23 456 789>        |  1:STAR(3)
      ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
   2 <12> <3 456 789 >       |  1:STAR(3)
       ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
   4 <123 > <456 789 x>      |  1:STAR(3)
        ^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   5 <123 4> <56 789 x>      |  1:STAR(3)
          ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
   6 <23 45> <6 789 x>       |  1:STAR(3)
          ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
   8 <23 456 > <789 x>       |  1:STAR(3)
           ^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   9 <23 456 7> <89 x>       |  1:STAR(3)
             ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
  10 <23 456 78> <9 x>       |  1:STAR(3)
              ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
  12 <23 456 789 > <x>       |  1:STAR(3)
               ^^
                                  POSIXD[\d] can match 0 times out of 2147483647...
  12 <23 456 789 > <x>       |  3:  EXACT <x>(5)
  13 <23 456 789 x> <>       |  5:  END(0)

请注意,优化不适用于 /\s\s+/,因为 \s+ 不在模式的开头。 /\s\s+/(逻辑上等同于/\s{2,}/)和/\s\s*/(逻辑上等同于/\s+/)可能可以被优化;在 perl5-porters 上询问是否值得付出努力可能是有意义的。


如果您有兴趣,可以通过在编译时在正则表达式上设置 PREGf_SKIP 标志来启用“触发器模式”。请参阅 5.24.0 源代码中 regcomp.c and line 1585 in regexec.c 中第 7344 和 7405 行的代码。