如何在 perl 中使用单个正则表达式将一行分成代码和注释?

how can I partition a line into code and comment using a single regex in perl?


下面的脚本实现了这一点,但需要几行代码、两个散列和一个复合正则表达式,即由 | 组合的 2 个正则表达式。 复合似乎是必要的,因为第一个子句


没有匹配到一行是纯代码,没有注释。 有没有更优雅的方法,使用单个正则表达式和更少的步骤? 特别是,有没有办法免除 %match 散列并以某种方式 在一个步骤中用所有三个变量填充 %+ 散列?

#!/usr/bin/env perl
use strict; use warnings;
print join('', 'perl ', $^V, "\n",);
use Data::Dumper qw(Dumper); $Data::Dumper::Sortkeys = 1;

my $count=0;
    print "$count\t";
    my %match=(
    print "|$_|\n";
        print "from regex:\n";
        print Dumper \%+;
        die "no match? coding error, should never get here";
    if(scalar keys %+ != scalar keys %match)
        print "from multiple lines of code:\n";
        print Dumper \%match;
    print "------------------------------------------\n";

This is 100\% text and below you find an empty line.

abba 5\% %comment 9\% %Borgia
%all comment


perl v5.34.0
1   |This is 100\% text and below you find an empty line.   |
from regex:
$VAR1 = {
          'a1code' => 'This is 100\% text and below you find an empty line.   '
from multiple lines of code:
$VAR1 = {
          'a1code' => 'This is 100\% text and below you find an empty line.   ',
          'a2boundary' => '',
          'a3cmnt' => ''
2   ||
from regex:
$VAR1 = {
          'a1code' => ''
from multiple lines of code:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '',
          'a3cmnt' => ''
3   |abba 5\% %comment 9\% %Borgia|
from regex:
$VAR1 = {
          'a1code' => 'abba 5\% ',
          'a2boundary' => '%',
          'a3cmnt' => 'comment 9\% %Borgia'
4   |%all comment|
from regex:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '%',
          'a3cmnt' => 'all comment'
5   |%|
from regex:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '%',
          'a3cmnt' => ''


my ($a1code, $a2boundary, $a3cmnt) =
      (  (?: [^\%]+ | \. )* )
      (?: (%) (.*) )?

它不考虑 %abc\%def 中转义,因为前面的 \ 被转义了。


$a1code 始终是一个字符串。它可以是零个字符长(当输入为空字符串且 % 是第一个字符时),或整个输入字符串(当没有未转义的 % 时)。

但是,$a2boundary$a3cmnt 仅在存在未转义的 % 时才定义。换句话说,$a2boundary 等价于 defined($a3cmnt) ? '%' : undef.

说明[^\%]+匹配除\%以外的非转义字符。 \. 匹配转义字符。所以 (?: [^\%]+ | \. )* 为我们提供前缀,或者如果没有未转义的 %.


this\%string 这样百分号前的反斜杠本身被转义的情况呢?

考虑这样的事情,它不是尝试使用正则表达式将字符串分成三组,而是使用一个来查找应该拆分的位置,然后 substr 进行实际拆分:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

sub splitter {
    my $line = shift;
    if ($line =~ /
       # Match either
       (?<!\)% # A % not preceded by a backslash    
       | # or                    
       (?<=[^\])(?:\\)+\K% # Any even number of backslashes followed by a %
                 /x) {
        return (substr($line, 0, $-[0]), '%', substr($line, $+[0]));        
    } else {
        return ($line, '', '');

while (<DATA>) {
    # Assign to an array instead of individual scalars for demonstration purposes
    my @vals = splitter $_;
    print Dumper(\@vals);

This is 100\% text and below you find an empty line.

abba 5\% %comment 9\% %Borgia
%all comment
a tricky\%test % case
another \\%one % to mess with you