perl 多行正则表达式分隔段落内的注释

perl multiline regex to separate comments within paragraphs

下面的脚本可以工作,但它需要一些拼凑。我所说的“kludge”是指一行代码,它使脚本执行我想要的操作 --- 但我不明白为什么需要该行。显然,我不明白以 /mg 结尾的多行正则表达式替换究竟在做什么。

有没有更优雅的方式来完成任务?

脚本逐段读取文件。它将每个段落分成两个子集:$text$cmnt$text 包括每一行的左侧部分,即从第一列到第一个 %,如果存在,或者如果不存在则到行尾。 $cmnt 包括其余部分。

动机:要阅读的文件是 LaTeX 标记,其中 % 宣布注释的开始。如果我们正在阅读 perl 脚本,我们可以将 $breaker 的值更改为等于 #。将 $text$cmnt 分开后,可以执行跨行匹配,例如

print "match" if ($text =~ /WOLF\s*DOG/s);

请参阅标有“kludge”的行。 如果没有该行,在记录中的最后一个 % 之后会发生一些有趣的事情。如果有$text行 (material 未被 % 注释掉)在记录的最后注释行之后,这些行都包含在 $cmnt 的末尾和 $text.[=40 中=]

在下面的示例中,这意味着在记录 2 中没有 kludge,“cat lion”既包含在它所属的 $text 中,也包含在 $cmnt 中。

(kludge 导致不必要的 % 出现在每个非空 $cmnt 的末尾。这是因为 kludge-pasted-on % 宣布了一个最终,虚构的空注释行。)

根据 https://perldoc.perl.org/perlre.html#Modifiers/m 正则表达式修饰符表示

Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.

因此,我预计第二场比赛会在

s/^([^$breaker]*)($breaker.*?)$//mg

从第一个 % 开始,一直延伸到行尾,然后停在那里。那么即使没有kludge,记录2中应该也没有“猫狮”吧?但显然它确实如此,所以我误读或遗漏了文档的某些部分。我怀疑它与 /g 正则表达式修饰符有关?

#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
    $count_record++; 
    my $text = $_; 
    my $cmnt;
    s/[\n]*\z/$breaker/; # kludge
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$//mg)  # non-greedy
    {
        $cmnt    = $_; 
        die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$//mg);  # non-greedy
    }
    else
    {
        $cmnt    = ''; 
    }
    print "\nRECORD $count_record:\n";
    print "******** text==";
    print "\n|";
    print $text;
    print "|\n";
    print "******** cmnt==|";
    print $cmnt;
    print "|\n";
}

示例文件 运行 它在:

dog wolf % flea 
DOG WOLF % FLEA 
DOG WOLLLLLLF % FLLLLLLEA 


% what was that?
 cat lion


no comments in this line




%The last paragraph of this file is nothing but a single-line comment.

正则表达式修饰符 mg 假定它所应用的字符串包含多行(包括字符串中的 \n)。它指示正则表达式查看字符串中的所有

请研究以下代码,它应该可以简化您问题的解决方案。

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my $breaker = '%';
my @records = do { local $/ = ''; <DATA> };

for( @records ) {
    my %hash = ( /(.*?)$breaker(.*)/mg );
    next unless %hash;
    say Dumper(\%hash);
}

__DATA__
dog wolf % flea 
DOG WOLF % FLEA 
DOG WOLLLLLLF % FLLLLLLEA 


% what was that?
 cat lion


no comments in this line




%The last paragraph of this file is nothing but a single-line comment.

输出

$VAR1 = {
          'DOG WOLF ' => ' FLEA ',
          'dog wolf ' => ' flea ',
          'DOG WOLLLLLLF ' => ' FLLLLLLEA '
        };

$VAR1 = {
          '' => ' what was that?'
        };

$VAR1 = {
          '' => 'The last paragraph of this file is nothing but a single-line comment.'
        };

您还必须从 $cmnt:

中删除不包含注释的行
use feature qw(say);
use strict;
use warnings;

my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
    $count_record++;
    my $text = $_;
    my $cmnt;
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$//mg)  # non-greedy
    {
        $cmnt    = $_;
        $cmnt =~ s/^[^$breaker]*?$//mg;
        die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$//mg);  # non-greedy
    }
    else
    {
        $cmnt    = '';
    }
    print "\nRECORD $count_record:\n";
    print "******** text==";
    print "\n|";
    print $text;
    print "|\n";
    print "******** cmnt==|";
    print $cmnt;
    print "|\n";
}

输出:

RECORD 1:
******** text==
|dog wolf 
DOG WOLF 
DOG WOLLLLLLF 

|
******** cmnt==|% flea 
% FLEA 
% FLLLLLLEA 
|

RECORD 2:
******** text==
|
 cat lion

|
******** cmnt==|% what was that?

|

RECORD 3:
******** text==
|no comments in this line

|
******** cmnt==||

RECORD 4:
******** text==
||
******** cmnt==|%The last paragraph of this file is nothing but a single-line comment.
|

我的主要困惑是无法区分

  1. 是否匹配整个记录 -- 此处,一条记录可能是 multi-line 段,和
  2. 记录行是否匹配。

以下脚本结合了其他人提供的两个答案的见解,并包含广泛的解释。

#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';

$/ = ''; # one paragraph at a time
while(<DATA>)
{
    $count_record++; 
    my $text = $_; 
    my $cmnt;
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
    print "RECORD $count_record:";
    print "\n|"; print $_; print "|\n";
    # https://perldoc.perl.org/perlre.html#Modifiers
    # the following regex:
    # ^                     /m: ^==start of line, not of record
    # ([^$breaker]*)        zero or more characters that are not $breaker
    # ($breaker.*?)         non-greedy: the first instance of $breaker, followed by everything after $breaker
    # $                     /m: $==end   of line, not of record
    #                       /g: "globally match the pattern repeatedly in the string"
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$//mg)
    {
        $cmnt    = $_; 
        # In at least one line of this record, the pattern above has matched.
        # But this does not mean every line matches. There may be any number of
        # lines inside the record that do not match /$breaker/; for these lines,
        # in spite of /g, there will be no match, and thus the exclusion of  and printing only of ,
        # in the substitution below, will not take place. Thus, those particular lines must be deleted from $cmnt. 
        # Thus:
        $cmnt =~ s/^[^$breaker]*?$/\n/mg; # remove entire line if it does not match /$breaker/
        # recall that /m guarantees that ^ and $ match the start and end of the line, not of the record.
        die "code error: cmnt does not match this record" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$//mg);
        if ( $text =~ /\S/ )
        {
            print "|text|==\n|$text|\n";
        }
        else
        {
            print "NO text found\n";
        }
        print "|cmnt|==\n|$cmnt|\n";
    }
    else
    {
        print "NO comment found\n";
    }
}

__DATA__
one dogs% one comment %d**n lies %statistics
two %two comment
thuh-ree
fower
fi-yiv % (he means 5)
SIX 66 % ¿666==antichrist?
seven % the seventh seal, the seven days
ate
niner
ten

As Douglass said to Lincoln ... 

%Darryl Pinckney