Perl 正则表达式在（跳过）N 个字符之后开始匹配搜索

Question

可能是基本问题，但找不到答案：我有一个模式匹配正则表达式，我在一个大缓冲区中再次寻找 'G'、187 个字符和 'G'。这适用于 $s =~ m/(G.{187}G)/s。有时我想在搜索中添加 N 字节的偏移量（我不想从缓冲区的位置 0 开始）。我现在可以 $s =~ m/.{N}(G.{187}G)/s 但这听起来效率不高，因为我不想解析所有开始缓冲区（它可能很大）。我尝试使用 \G 但无法正确处理。

谢谢

Answer 1

据我了解，效率上不会有什么区别。根据我对有限自动机（正则表达式使用的）的理解，无论如何，每个字符都必须循环。（否则，它怎么知道你的比赛什么时候开始？）任何正则表达式（假设它首先被编译）将在线性时间内执行。

如果你想跳过特定数量的字符，首先从指定位置获取一个子字符串，然后将你的正则表达式应用于该子字符串可能是个不错的主意。

Answer 2

基准测试：跳过超过几百个字符 pos 比 .{N}

快得多

您可以使用 pos 设置“最后一场比赛”（之后）的位置。例如

$_ = q(abcdef); 

while (/(.)/g)  
{ 
    say "Got  at pos ", pos();        

    if (++$cnt == 1) { pos = 4 } 
}

打印

Got a at pos 1
Got e at pos 5
Got f at pos 6

您也可以根据需要在比赛前进行设置

$_ = q(abcdef); 

pos = 4; 

while (/(.)/g) { say "Got  at pos ", pos() }

与

Got e at pos 5
Got f at pos 6

什么回答了你的直接问题。 \G 对此没有任何影响。

但我不确定这是否比您拥有的更好，因为 . 是一个非常简单的“匹配”，对于 .{N} 正则表达式引擎优化器可能会索引到 N+1 搜索以下模式。

事实证明，这对要跳过的序列的长度很敏感（对于非常长的序列来说至关重要）。由于它有数百个字符，基准测试支持 .{N}，在我的测试中高达 25%。但是从大约 1_000 个跳过的字符开始，结果迅速反转。运行在笔记本电脑上使用 v5.16。

use warnings;
use strict;
use feature 'say';
use Benchmark qw(cmpthese);

my $N = 10_000; # 100; 
my $rep = 1;
# Insert  at the beginning a phrase that also matches, for a test
my $str = 'G'.'a'x$rep.'G' . 'b'x($N-2-$rep) . 'G'.'X'x$rep.'G';

sub posit {
    my ($str, $N, $rep) = @_;
    pos ($str) = $N;
    my ($res) = $str =~ /(G.{$rep}G)/g;
    return $res;
}   

sub match {
    my ($str, $N, $rep) = @_;
    my ($res) = $str =~ /.{$N}(G.{$rep}G)/;
    return $res;
}   

say "posit: ", posit($str,$N,$rep), " and match: ", match($str,$N,$rep);
say "Benchmark skipping first $N positions\n";

cmpthese(-10, {
    posit => sub { my $res = posit ($str, $N, $rep) },
    match => sub { my $res = match ($str, $N, $rep) },
});

请注意，我们必须在正则表达式中包含 /g 才能使 pos 生效。结果

posit: GXG and match: GXG
Benchmark skipping first 10000 positions

          Rate match posit
match 125252/s    --  -70%
posit 414886/s  231%    --

有趣的是，使用硬编码 GXG 字符串在所有情况下都会显着改变结果（并且总是有利于获胜者）。但是，这个参数可能会被传递和插入，所以我仍然使用 $rep。正如预期的那样，对其进行更改没有任何效果，因此我将其保持简短。

最后，问题指定了一个“大缓冲区”。当我将上面的 $N 设置为 100_000 时，我得到

Quantifier in {,} bigger than 32766 in regex; marked by ... 100000}(G.{1}G)/ at ...

所以对于非常大的缓冲区，您必须使用pos().

注意pos也有副作用，影响后续操作。请参阅文档。

Answer 3

我测试了两种方法的速度：

use feature qw(say);
use strict;
use warnings;

use Benchmark qw(:all);
use Getopt::Long qw(GetOptions);

GetOptions( "case=i" => \my $case ) or die("Error in command line arguments\n");

my $num_cases = 5;
sub case1 { return (187, 2000) };
sub case2 { return (5, 10000) };
sub case3 { return (5, 20000) };
sub case4 { return (10000, 20000) };
sub case5 { return (5, 40000) };

my @cases = ( $case );
if ( !defined $case ) {
    @cases = 1..$num_cases;
}
for my $case ( @cases ) {
    run_case( $case );
}

sub run_case {
    my ( $case ) = @_;

    my $case_coderef = \&{"case" . $case};
    my ( $M, $N ) = $case_coderef->();
    say "Running case $case: $M = $M, $N = $N";

    my $prefix = 'A' x $N;
    my $middle = 'B' x $M;
    my $match_str = 'G' . $middle . 'G';

    my $str = $prefix . $match_str;

    my %methods = map {; "method$_" => \&{"method" . $_} } 1..6;

    for my $meth (keys %methods) {
        my $res = eval {
            $methods{$meth}->($str, $M, $N)
        };
        if ( $@ ) {
            print "$@";
            say "Skipping method '$meth'..";
            delete $methods{$meth};
            next;
        }   
        die "Method '$meth' failed.\n" if $res ne $match_str;
    }

    my %code = map { $_ => eval ('sub { $methods{' . $_ . '}->($str, $M, $N) }') } sort keys %methods; 

    cmpthese(-5, \%code );
}

sub method1 {
    my ( $str, $M, $N ) = @_;
    $str =~ m/.{$N}(G.{$M}G)/;
    return ;
}

sub method2 {
    my ( $str, $M, $N ) = @_;
    pos( $str ) = $N;
    $str =~ m/\G(G.{$M}G)/;
    return ;
}

sub method3 {
    my ( $str, $M, $N ) = @_;
    pos( $str ) = $N;
    $str =~ m/(G.{$M}G)/g;
    return ;
}

sub method4 {
    my ( $str, $M, $N ) = @_;
    $str =~ m/.{$N}(G.{$M}G)/s;
    return ;
}

sub method5 {
    my ( $str, $M, $N ) = @_;
    pos( $str ) = $N;
    $str =~ m/\G(G.{$M}G)/s;
    return ;
}

sub method6 {
    my ( $str, $M, $N ) = @_;
    pos( $str ) = $N;
    $str =~ m/(G.{$M}G)/gs;
    return ;
}

输出:

Running case 1: $M = 187, $N = 2000
             Rate method1 method3 method2 method6 method5 method4
method1  696485/s      --    -37%    -39%    -44%    -46%    -57%
method3 1112322/s     60%      --     -3%    -10%    -13%    -32%
method2 1146132/s     65%      3%      --     -7%    -10%    -30%
method6 1234678/s     77%     11%      8%      --     -3%    -24%
method5 1278898/s     84%     15%     12%      4%      --    -21%
method4 1629148/s    134%     46%     42%     32%     27%      --
Running case 2: $M = 5, $N = 10000
             Rate method1 method2 method6 method5 method3 method4
method1  226784/s      --    -72%    -72%    -72%    -72%    -78%
method2  801020/s    253%      --     -0%     -2%     -3%    -23%
method6  802386/s    254%      0%      --     -1%     -3%    -23%
method5  814132/s    259%      2%      1%      --     -1%    -22%
method3  823653/s    263%      3%      3%      1%      --    -21%
method4 1046605/s    361%     31%     30%     29%     27%      --
Running case 3: $M = 5, $N = 20000
             Rate method1 method3 method2 method6 method5 method4
method1  122763/s      --    -90%    -90%    -90%    -90%    -92%
method3 1252858/s    921%      --     -0%     -1%     -2%    -23%
method2 1258330/s    925%      0%      --     -1%     -1%    -22%
method6 1265165/s    931%      1%      1%      --     -1%    -22%
method5 1274309/s    938%      2%      1%      1%      --    -21%
method4 1622943/s   1222%     30%     29%     28%     27%      --
Running case 4: $M = 10000, $N = 20000
             Rate method1 method3 method2 method6 method5 method4
method1   90687/s      --    -62%    -62%    -93%    -93%    -94%
method3  236835/s    161%      --     -1%    -81%    -81%    -86%
method2  238334/s    163%      1%      --    -81%    -81%    -85%
method6 1236025/s   1263%    422%    419%      --     -3%    -25%
method5 1270943/s   1301%    437%    433%      3%      --    -22%
method4 1638548/s   1707%    592%    588%     33%     29%      --
Running case 5: $M = 5, $N = 40000
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/.{ <-- HERE 40000}(G.{5}G)/ at ./p.pl line 83.
Skipping method 'method4'..
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/.{ <-- HERE 40000}(G.{5}G)/ at ./p.pl line 63.
Skipping method 'method1'..
             Rate method2 method3 method6 method5
method2 1253528/s      --     -1%     -1%     -2%
method3 1260746/s      1%      --     -0%     -1%
method6 1263378/s      1%      0%      --     -1%
method5 1278718/s      2%      1%      1%      --

我运行在我的笔记本电脑上使用 Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz 运行 Ubuntu 16.10，结果显示使用 s 修饰符可以显着加快 method1 的速度（与 method4 相比）。但是当$N超过32766时，method1或method4就不能用了。

Perl 正则表达式在（跳过）N 个字符之后开始匹配搜索

Perl regex start match search after (skip) N char

regex

perl

skip

offset