perl 多行正则表达式分隔段落内的注释
perl multiline regex to separate comments within paragraphs
下面的脚本可以工作,但它需要一些拼凑。我所说的“kludge”是指一行代码,它使脚本执行我想要的操作 --- 但我不明白为什么需要该行。显然,我不明白以 /mg
结尾的多行正则表达式替换究竟在做什么。
有没有更优雅的方式来完成任务?
脚本逐段读取文件。它将每个段落分成两个子集:$text
和 $cmnt
。 $text
包括每一行的左侧部分,即从第一列到第一个 %
,如果存在,或者如果不存在则到行尾。 $cmnt
包括其余部分。
动机:要阅读的文件是 LaTeX 标记,其中 %
宣布注释的开始。如果我们正在阅读 perl 脚本,我们可以将 $breaker
的值更改为等于 #
。将 $text
与 $cmnt
分开后,可以执行跨行匹配,例如
print "match" if ($text =~ /WOLF\s*DOG/s);
请参阅标有“kludge”的行。
如果没有该行,在记录中的最后一个 %
之后会发生一些有趣的事情。如果有$text
行
(material 未被 %
注释掉)在记录的最后注释行之后,这些行都包含在 $cmnt
的末尾和 $text
.[=40 中=]
在下面的示例中,这意味着在记录 2 中没有 kludge,“cat lion”既包含在它所属的 $text
中,也包含在 $cmnt
中。
(kludge 导致不必要的 %
出现在每个非空 $cmnt
的末尾。这是因为 kludge-pasted-on %
宣布了一个最终,虚构的空注释行。)
根据 https://perldoc.perl.org/perlre.html#Modifiers,/m
正则表达式修饰符表示
Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
因此,我预计第二场比赛会在
s/^([^$breaker]*)($breaker.*?)$//mg
从第一个 %
开始,一直延伸到行尾,然后停在那里。那么即使没有kludge,记录2中应该也没有“猫狮”吧?但显然它确实如此,所以我误读或遗漏了文档的某些部分。我怀疑它与 /g
正则表达式修饰符有关?
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/$breaker/; # kludge
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$//mg) # non-greedy
{
$cmnt = $_;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$//mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
示例文件 运行 它在:
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.
正则表达式修饰符 mg 假定它所应用的字符串包含多行(包括字符串中的 \n
)。它指示正则表达式查看字符串中的所有 行 。
请研究以下代码,它应该可以简化您问题的解决方案。
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $breaker = '%';
my @records = do { local $/ = ''; <DATA> };
for( @records ) {
my %hash = ( /(.*?)$breaker(.*)/mg );
next unless %hash;
say Dumper(\%hash);
}
__DATA__
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.
输出
$VAR1 = {
'DOG WOLF ' => ' FLEA ',
'dog wolf ' => ' flea ',
'DOG WOLLLLLLF ' => ' FLLLLLLEA '
};
$VAR1 = {
'' => ' what was that?'
};
$VAR1 = {
'' => 'The last paragraph of this file is nothing but a single-line comment.'
};
您还必须从 $cmnt
:
中删除不包含注释的行
use feature qw(say);
use strict;
use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$//mg) # non-greedy
{
$cmnt = $_;
$cmnt =~ s/^[^$breaker]*?$//mg;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$//mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
输出:
RECORD 1:
******** text==
|dog wolf
DOG WOLF
DOG WOLLLLLLF
|
******** cmnt==|% flea
% FLEA
% FLLLLLLEA
|
RECORD 2:
******** text==
|
cat lion
|
******** cmnt==|% what was that?
|
RECORD 3:
******** text==
|no comments in this line
|
******** cmnt==||
RECORD 4:
******** text==
||
******** cmnt==|%The last paragraph of this file is nothing but a single-line comment.
|
我的主要困惑是无法区分
- 是否匹配整个记录 -- 此处,一条记录可能是 multi-line 段,和
- 记录内行是否匹配。
以下脚本结合了其他人提供的两个答案的见解,并包含广泛的解释。
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<DATA>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
print "RECORD $count_record:";
print "\n|"; print $_; print "|\n";
# https://perldoc.perl.org/perlre.html#Modifiers
# the following regex:
# ^ /m: ^==start of line, not of record
# ([^$breaker]*) zero or more characters that are not $breaker
# ($breaker.*?) non-greedy: the first instance of $breaker, followed by everything after $breaker
# $ /m: $==end of line, not of record
# /g: "globally match the pattern repeatedly in the string"
if ($text =~ s/^([^$breaker]*)($breaker.*?)$//mg)
{
$cmnt = $_;
# In at least one line of this record, the pattern above has matched.
# But this does not mean every line matches. There may be any number of
# lines inside the record that do not match /$breaker/; for these lines,
# in spite of /g, there will be no match, and thus the exclusion of and printing only of ,
# in the substitution below, will not take place. Thus, those particular lines must be deleted from $cmnt.
# Thus:
$cmnt =~ s/^[^$breaker]*?$/\n/mg; # remove entire line if it does not match /$breaker/
# recall that /m guarantees that ^ and $ match the start and end of the line, not of the record.
die "code error: cmnt does not match this record" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$//mg);
if ( $text =~ /\S/ )
{
print "|text|==\n|$text|\n";
}
else
{
print "NO text found\n";
}
print "|cmnt|==\n|$cmnt|\n";
}
else
{
print "NO comment found\n";
}
}
__DATA__
one dogs% one comment %d**n lies %statistics
two %two comment
thuh-ree
fower
fi-yiv % (he means 5)
SIX 66 % ¿666==antichrist?
seven % the seventh seal, the seven days
ate
niner
ten
As Douglass said to Lincoln ...
%Darryl Pinckney
下面的脚本可以工作,但它需要一些拼凑。我所说的“kludge”是指一行代码,它使脚本执行我想要的操作 --- 但我不明白为什么需要该行。显然,我不明白以 /mg
结尾的多行正则表达式替换究竟在做什么。
有没有更优雅的方式来完成任务?
脚本逐段读取文件。它将每个段落分成两个子集:$text
和 $cmnt
。 $text
包括每一行的左侧部分,即从第一列到第一个 %
,如果存在,或者如果不存在则到行尾。 $cmnt
包括其余部分。
动机:要阅读的文件是 LaTeX 标记,其中 %
宣布注释的开始。如果我们正在阅读 perl 脚本,我们可以将 $breaker
的值更改为等于 #
。将 $text
与 $cmnt
分开后,可以执行跨行匹配,例如
print "match" if ($text =~ /WOLF\s*DOG/s);
请参阅标有“kludge”的行。
如果没有该行,在记录中的最后一个 %
之后会发生一些有趣的事情。如果有$text
行
(material 未被 %
注释掉)在记录的最后注释行之后,这些行都包含在 $cmnt
的末尾和 $text
.[=40 中=]
在下面的示例中,这意味着在记录 2 中没有 kludge,“cat lion”既包含在它所属的 $text
中,也包含在 $cmnt
中。
(kludge 导致不必要的 %
出现在每个非空 $cmnt
的末尾。这是因为 kludge-pasted-on %
宣布了一个最终,虚构的空注释行。)
根据 https://perldoc.perl.org/perlre.html#Modifiers,/m
正则表达式修饰符表示
Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
因此,我预计第二场比赛会在
s/^([^$breaker]*)($breaker.*?)$//mg
从第一个 %
开始,一直延伸到行尾,然后停在那里。那么即使没有kludge,记录2中应该也没有“猫狮”吧?但显然它确实如此,所以我误读或遗漏了文档的某些部分。我怀疑它与 /g
正则表达式修饰符有关?
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/$breaker/; # kludge
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$//mg) # non-greedy
{
$cmnt = $_;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$//mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
示例文件 运行 它在:
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.
正则表达式修饰符 mg 假定它所应用的字符串包含多行(包括字符串中的 \n
)。它指示正则表达式查看字符串中的所有 行 。
请研究以下代码,它应该可以简化您问题的解决方案。
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $breaker = '%';
my @records = do { local $/ = ''; <DATA> };
for( @records ) {
my %hash = ( /(.*?)$breaker(.*)/mg );
next unless %hash;
say Dumper(\%hash);
}
__DATA__
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.
输出
$VAR1 = {
'DOG WOLF ' => ' FLEA ',
'dog wolf ' => ' flea ',
'DOG WOLLLLLLF ' => ' FLLLLLLEA '
};
$VAR1 = {
'' => ' what was that?'
};
$VAR1 = {
'' => 'The last paragraph of this file is nothing but a single-line comment.'
};
您还必须从 $cmnt
:
use feature qw(say);
use strict;
use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$//mg) # non-greedy
{
$cmnt = $_;
$cmnt =~ s/^[^$breaker]*?$//mg;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$//mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
输出:
RECORD 1:
******** text==
|dog wolf
DOG WOLF
DOG WOLLLLLLF
|
******** cmnt==|% flea
% FLEA
% FLLLLLLEA
|
RECORD 2:
******** text==
|
cat lion
|
******** cmnt==|% what was that?
|
RECORD 3:
******** text==
|no comments in this line
|
******** cmnt==||
RECORD 4:
******** text==
||
******** cmnt==|%The last paragraph of this file is nothing but a single-line comment.
|
我的主要困惑是无法区分
- 是否匹配整个记录 -- 此处,一条记录可能是 multi-line 段,和
- 记录内行是否匹配。
以下脚本结合了其他人提供的两个答案的见解,并包含广泛的解释。
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<DATA>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
print "RECORD $count_record:";
print "\n|"; print $_; print "|\n";
# https://perldoc.perl.org/perlre.html#Modifiers
# the following regex:
# ^ /m: ^==start of line, not of record
# ([^$breaker]*) zero or more characters that are not $breaker
# ($breaker.*?) non-greedy: the first instance of $breaker, followed by everything after $breaker
# $ /m: $==end of line, not of record
# /g: "globally match the pattern repeatedly in the string"
if ($text =~ s/^([^$breaker]*)($breaker.*?)$//mg)
{
$cmnt = $_;
# In at least one line of this record, the pattern above has matched.
# But this does not mean every line matches. There may be any number of
# lines inside the record that do not match /$breaker/; for these lines,
# in spite of /g, there will be no match, and thus the exclusion of and printing only of ,
# in the substitution below, will not take place. Thus, those particular lines must be deleted from $cmnt.
# Thus:
$cmnt =~ s/^[^$breaker]*?$/\n/mg; # remove entire line if it does not match /$breaker/
# recall that /m guarantees that ^ and $ match the start and end of the line, not of the record.
die "code error: cmnt does not match this record" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$//mg);
if ( $text =~ /\S/ )
{
print "|text|==\n|$text|\n";
}
else
{
print "NO text found\n";
}
print "|cmnt|==\n|$cmnt|\n";
}
else
{
print "NO comment found\n";
}
}
__DATA__
one dogs% one comment %d**n lies %statistics
two %two comment
thuh-ree
fower
fi-yiv % (he means 5)
SIX 66 % ¿666==antichrist?
seven % the seventh seal, the seven days
ate
niner
ten
As Douglass said to Lincoln ...
%Darryl Pinckney