如何在 perl 中使用单个正则表达式将一行分成代码和注释?

how can I partition a line into code and comment using a single regex in perl?

我想通读一个文本文件并将每一行分成以下三个变量。必须定义每个变量,尽管它可能等于空字符串。

下面的脚本实现了这一点,但需要几行代码、两个散列和一个复合正则表达式,即由 | 组合的 2 个正则表达式。 复合似乎是必要的,因为第一个子句

(?<a1code>.*?)(?<a2boundary>(?<!\)%)(?<a3cmnt>.*)

没有匹配到一行是纯代码,没有注释。 有没有更优雅的方法,使用单个正则表达式和更少的步骤? 特别是,有没有办法免除 %match 散列并以某种方式 在一个步骤中用所有三个变量填充 %+ 散列?

#!/usr/bin/env perl
use strict; use warnings;
print join('', 'perl ', $^V, "\n",);
use Data::Dumper qw(Dumper); $Data::Dumper::Sortkeys = 1;

my $count=0;
while(<DATA>)
{
    $count++;
    print "$count\t";
    chomp;
    my %match=(
        a2boundary=>'',
        a3cmnt=>'',
    );
    print "|$_|\n";
    if($_=~/^(?<a1code>.*?)(?<a2boundary>(?<!\)%)(?<a3cmnt>.*)|(?<a1code>.*)/)
    {
        print "from regex:\n";
        print Dumper \%+;
        %match=(%match,%+,);
    }
    else
    {
        die "no match? coding error, should never get here";
    }
    if(scalar keys %+ != scalar keys %match)
    {
        print "from multiple lines of code:\n";
        print Dumper \%match;
    }
    print "------------------------------------------\n";
}

__DATA__
This is 100\% text and below you find an empty line.

abba 5\% %comment 9\% %Borgia
%all comment
%

结果:

perl v5.34.0
1   |This is 100\% text and below you find an empty line.   |
from regex:
$VAR1 = {
          'a1code' => 'This is 100\% text and below you find an empty line.   '
        };
from multiple lines of code:
$VAR1 = {
          'a1code' => 'This is 100\% text and below you find an empty line.   ',
          'a2boundary' => '',
          'a3cmnt' => ''
        };
------------------------------------------
2   ||
from regex:
$VAR1 = {
          'a1code' => ''
        };
from multiple lines of code:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '',
          'a3cmnt' => ''
        };
------------------------------------------
3   |abba 5\% %comment 9\% %Borgia|
from regex:
$VAR1 = {
          'a1code' => 'abba 5\% ',
          'a2boundary' => '%',
          'a3cmnt' => 'comment 9\% %Borgia'
        };
------------------------------------------
4   |%all comment|
from regex:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '%',
          'a3cmnt' => 'all comment'
        };
------------------------------------------
5   |%|
from regex:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '%',
          'a3cmnt' => ''
        };
------------------------------------------

您可以使用以下内容:

my ($a1code, $a2boundary, $a3cmnt) =
   /
      ^
      (  (?: [^\%]+ | \. )* )
      (?: (%) (.*) )?
      \z
   /sx;

它不考虑 %abc\%def 中转义,因为前面的 \ 被转义了。

它不需要回溯,而且总是匹配。

$a1code 始终是一个字符串。它可以是零个字符长(当输入为空字符串且 % 是第一个字符时),或整个输入字符串(当没有未转义的 % 时)。

但是,$a2boundary$a3cmnt 仅在存在未转义的 % 时才定义。换句话说,$a2boundary 等价于 defined($a3cmnt) ? '%' : undef.

说明[^\%]+匹配除\%以外的非转义字符。 \. 匹配转义字符。所以 (?: [^\%]+ | \. )* 为我们提供前缀,或者如果没有未转义的 %.

则为整个字符串

this\%string 这样百分号前的反斜杠本身被转义的情况呢?

考虑这样的事情,它不是尝试使用正则表达式将字符串分成三组,而是使用一个来查找应该拆分的位置,然后 substr 进行实际拆分:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

sub splitter {
    my $line = shift;
    if ($line =~ /
       # Match either
       (?<!\)% # A % not preceded by a backslash    
       | # or                    
       (?<=[^\])(?:\\)+\K% # Any even number of backslashes followed by a %
                 /x) {
        return (substr($line, 0, $-[0]), '%', substr($line, $+[0]));        
    } else {
        return ($line, '', '');
    }
}

while (<DATA>) {
    chomp;
    # Assign to an array instead of individual scalars for demonstration purposes
    my @vals = splitter $_;
    print Dumper(\@vals);
}   

__DATA__
This is 100\% text and below you find an empty line.

abba 5\% %comment 9\% %Borgia
%all comment
%
a tricky\%test % case
another \\%one % to mess with you