如何在 perl 中使用单个正则表达式将一行分成代码和注释？

Question

我想通读一个文本文件并将每一行分成以下三个变量。必须定义每个变量，尽管它可能等于空字符串。

$a1code：所有字符直到但不包括第一个非转义百分号。如果没有非转义百分号，这就是整行。正如我们在下面的示例中看到的，这也可能是一行中的空字符串，其中以下两个变量是非空的。
$a2boundary: 第一个非转义百分号，如果有的话。
$a3cmnt: 第一个非转义百分号后的任何字符，如果有的话。

下面的脚本实现了这一点，但需要几行代码、两个散列和一个复合正则表达式，即由 | 组合的 2 个正则表达式。复合似乎是必要的，因为第一个子句

(?<a1code>.*?)(?<a2boundary>(?<!\)%)(?<a3cmnt>.*)

没有匹配到一行是纯代码，没有注释。有没有更优雅的方法，使用单个正则表达式和更少的步骤？特别是，有没有办法免除 %match 散列并以某种方式在一个步骤中用所有三个变量填充 %+ 散列？

#!/usr/bin/env perl
use strict; use warnings;
print join('', 'perl ', $^V, "\n",);
use Data::Dumper qw(Dumper); $Data::Dumper::Sortkeys = 1;

my $count=0;
while(<DATA>)
{
    $count++;
    print "$count\t";
    chomp;
    my %match=(
        a2boundary=>'',
        a3cmnt=>'',
    );
    print "|$_|\n";
    if($_=~/^(?<a1code>.*?)(?<a2boundary>(?<!\)%)(?<a3cmnt>.*)|(?<a1code>.*)/)
    {
        print "from regex:\n";
        print Dumper \%+;
        %match=(%match,%+,);
    }
    else
    {
        die "no match? coding error, should never get here";
    }
    if(scalar keys %+ != scalar keys %match)
    {
        print "from multiple lines of code:\n";
        print Dumper \%match;
    }
    print "------------------------------------------\n";
}

__DATA__
This is 100\% text and below you find an empty line.

abba 5\% %comment 9\% %Borgia
%all comment
%

结果：

perl v5.34.0
1   |This is 100\% text and below you find an empty line.   |
from regex:
$VAR1 = {
          'a1code' => 'This is 100\% text and below you find an empty line.   '
        };
from multiple lines of code:
$VAR1 = {
          'a1code' => 'This is 100\% text and below you find an empty line.   ',
          'a2boundary' => '',
          'a3cmnt' => ''
        };
------------------------------------------
2   ||
from regex:
$VAR1 = {
          'a1code' => ''
        };
from multiple lines of code:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '',
          'a3cmnt' => ''
        };
------------------------------------------
3   |abba 5\% %comment 9\% %Borgia|
from regex:
$VAR1 = {
          'a1code' => 'abba 5\% ',
          'a2boundary' => '%',
          'a3cmnt' => 'comment 9\% %Borgia'
        };
------------------------------------------
4   |%all comment|
from regex:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '%',
          'a3cmnt' => 'all comment'
        };
------------------------------------------
5   |%|
from regex:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '%',
          'a3cmnt' => ''
        };
------------------------------------------

Answer 1

您可以使用以下内容：

my ($a1code, $a2boundary, $a3cmnt) =
   /
      ^
      (  (?: [^\%]+ | \. )* )
      (?: (%) (.*) )?
      \z
   /sx;

它不考虑 % 在 abc\%def 中转义，因为前面的 \ 被转义了。

它不需要回溯，而且总是匹配。

$a1code 始终是一个字符串。它可以是零个字符长（当输入为空字符串且 % 是第一个字符时），或整个输入字符串（当没有未转义的 % 时）。

但是，$a2boundary 和 $a3cmnt 仅在存在未转义的 % 时才定义。换句话说，$a2boundary 等价于 defined($a3cmnt) ? '%' : undef.

说明：[^\%]+匹配除\和%以外的非转义字符。 \. 匹配转义字符。所以 (?: [^\%]+ | \. )* 为我们提供前缀，或者如果没有未转义的 %.

则为整个字符串

Answer 2

像 this\%string 这样百分号前的反斜杠本身被转义的情况呢？

考虑这样的事情，它不是尝试使用正则表达式将字符串分成三组，而是使用一个来查找应该拆分的位置，然后 substr 进行实际拆分：

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

sub splitter {
    my $line = shift;
    if ($line =~ /
       # Match either
       (?<!\)% # A % not preceded by a backslash    
       | # or                    
       (?<=[^\])(?:\\)+\K% # Any even number of backslashes followed by a %
                 /x) {
        return (substr($line, 0, $-[0]), '%', substr($line, $+[0]));        
    } else {
        return ($line, '', '');
    }
}

while (<DATA>) {
    chomp;
    # Assign to an array instead of individual scalars for demonstration purposes
    my @vals = splitter $_;
    print Dumper(\@vals);
}   

__DATA__
This is 100\% text and below you find an empty line.

abba 5\% %comment 9\% %Borgia
%all comment
%
a tricky\%test % case
another \\%one % to mess with you

如何在 perl 中使用单个正则表达式将一行分成代码和注释？

how can I partition a line into code and comment using a single regex in perl?

perl

hashtable

regex-group

regex-lookarounds