为什么带有 /mg 修饰符的 perl 正则表达式匹配行尾?
Why does perl regex with /mg modifier match past end-of-line?
这与有关,
但只关注正则表达式语法的一个问题。
根据perlre: Modifiers,
/m
正则表达式修饰符表示
Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
因此,用下面的代码:
#!/usr/bin/perl
use strict; use warnings;
$/ = ''; # one paragraph at a time
while(<DATA>)
{
print "original:\n";
print;
s/^([^B]*)(B.*?)$/>|</mg;
print "\n\nafter substitution:\n";
print;
}
__DATA__
aaaaBaBaBB
bbbbBbadbe
cccc
dddd
eeeeBeeeee
ffff
gggg
我希望正则表达式引擎的行为如下。
第 1 行:匹配,因为它在该行的开头和结尾之间找到了两个模式。
第 2 行:同上。
第 3 行:无 匹配。第一个正则表达式组(在第一组括号中)匹配。
但是当我们到达行尾时,
我们仍在寻找 B,开始
第二个正则表达式组。由于我们指定了 /m
,这一行的结尾意味着我们已经达到了 $
而没有满足整个模式。
第 4 行:我们开始新的一行,所以我们遇到了一个新的 ^
。再次,不匹配。
第 5 行:匹配。两个正则表达式组都位于行的开头和结尾之间,即 ^
和 $
之间,完全符合指定。
因此我希望看到
>aaaa|BaBaBB<
>bbbb|Bbadbe<
cccc
dddd
>eeee|Beeeee<
ffff
gggg
相反,似乎在第 3 行,引擎 忽略 行尾并搜索过去。
它将第 3--5 行视为单行,如果我们愿意突然忽略 $
表示行尾,则将满足正则表达式。这是我们看到的:
>aaaa|BaBaBB<
>bbbb|Bbadbe<
>cccc
dddd
eeee|Beeeee<
ffff
gggg
这与 /m
规范如何一致?此行为记录在何处?
> perl --version
This is perl 5, version 18, subversion 4 (v5.18.4) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)
[^B]*
将匹配尽可能多的 non-B 个字符,包括换行符。
用 [^B\n]*
替换它可能会做你想要的。
When a regexp can match a string in several different ways, we can use
the principles above to predict which way the regexp will match:
Principle 0: Taken as a whole, any regexp will be matched at the
earliest possible position in the string.
Principle 1: In an alternation a|b|c... , the leftmost alternative
that allows a match for the whole regexp will be the one used.
Principle 2: The maximal matching quantifiers '?' , '*' , '+' and
{n,m} will in general match as much of the string as possible while
still allowing the whole regexp to match.
Principle 3: If there are two or more elements in a regexp, the
leftmost greedy quantifier, if any, will match as much of the string
as possible while still allowing the whole regexp to match. The next
leftmost greedy quantifier, if any, will try to match as much of the
string remaining available to it as possible, while still allowing the
whole regexp to match. And so on, until all the regexp elements are
satisfied.
As we have seen above, Principle 0 overrides the others. The regexp
will be matched as early as possible, with the other principles
determining how the regexp matches at that earliest character
position.
[...]
We can modify principle 3 above to take into account non-greedy
quantifiers:
Principle 3: If there are two or more elements in a regexp, the
leftmost greedy (non-greedy) quantifier, if any, will match as much
(little) of the string as possible while still allowing the whole
regexp to match. The next leftmost greedy (non-greedy) quantifier, if
any, will try to match as much (little) of the string remaining
available to it as possible, while still allowing the whole regexp to
match. And so on, until all the regexp elements are satisfied.
所以对于这种情况
my $str = 'cccc
dddd
eeeeBeeeee
ffff
gggg';
$str =~ s/^([^B]*)(B.*?)$/>|</m;
我们使用原则 0 和原则 3,因此它将匹配 $str
中的开始位置(位置 0)。根据原则3,我们从最左边的元素开始:
^([^B]*)
它将匹配 "尽可能多的字符串,同时仍允许匹配整个正则表达式。",这意味着它将能够从字符串的开头匹配字符串和第一个 B
。然后引擎考虑下一个元素
(B.*?)$
仍然,根据原则 3:它将匹配 "尽可能少的字符串,同时仍允许匹配整个正则表达式。" 因此它将匹配 B
到找到的第一个新行。
/m
和 /s
修饰符和字符 classes 的 Perl 文档可以从连接点和添加更多示例中获益,我将在此处尝试。
不考虑/m
和/s
修饰符,字符class可以匹配换行符。 这就是为什么 [^B]*
匹配 \n
并在您的情况下通过多个换行符进行扩展。事实上,您可以指定一个字符 class 明确包含 ([\n]
) 或不包含 ([^\n]
) 换行符。除了换行符(\n
),还有一个non-newline字符(\N
)。
/s
修饰符 仅改变 .
的行为(它允许 .
匹配换行符) .它不会改变任何其他字符的行为 classes.
单独使用 /m
和 /s
修饰符可以获得明显不同的行为,如下例所示。此行为已记录在案,因此可预测,但并不总是直观的。我通常一起使用这些修饰符
(/ms
),发现它使我的代码更加直观和可维护。这样,我就不必每次都考虑换行匹配行为。事实上,出于习惯,我通常在我自己的代码中的大多数正则表达式中使用 /xms
修饰符,/x
允许代码更具可读性和可维护性(Conway(2005),第 236- 241, 罗马人 (2006).
参考资料:
perlrecharclass - Perl Regular Expression Character Classes: Backslash sequences
\N Match a character that isn't a newline.
perlreref - Perl Regular Expressions Reference: CHARACTER CLASSES
\N A non newline (when not followed by '{NAME}';;
not valid in a character class; equivalent to [^\n]; it's
like '.' without /s modifier)
perlre - Perl regular expressions: Modifiers
m
Treat the string being matched against as multiple lines. That is, change "^"
and "$"
from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
s
Treat the string as single line. That is, change "."
to match any character whatsoever, even a newline, which normally it would not match.
Used together, as /ms
, they let the "."
match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.
(请注意,除了 '.'
之外,它没有说明 /m
或 /s
改变字符 classes,所以我们可以从这里推断它们没有被改变)
使用 /xms
修饰符:
- Always use the /x flag.
- Always use the /m flag.
- Always use the /s flag.
(康威 (2005),第 236-241 页,Vromans (2006))
Damian Conway (2005) Perl 最佳实践:开发可维护代码的标准和样式。奥莱利媒体。 https://www.amazon.com/Perl-Best-Practices-Developing-Maintainable/dp/0596001738/
Perl 最佳实践:参考指南:https://www.squirrel.nl/pub/PBP_refguide-1.02.00.pdf
示例:
use strict;
use warnings;
use feature qw( say );
my @strings = (
"abcd\n", # single-line string
"abcd\nabcd\n", # multi-line string (first string repeated twice)
"abXd\nabcd\n", # multi-line string, same as above, but missing first 'c'
"abcd\nabXd\n", # multi-line string, same as above, but missing first 'c'
);
my @regexes = ( '^([^c]*)(c.*?)$' );
foreach my $string ( @strings ) {
foreach my $regex ( @regexes ) {
my @matches;
say "\n###";
say "# $string='$string'; $regex='$regex'";
@matches = map { "'$_'" } $string =~ /$regex/;
say "regex_modifiers=''; \@matches=@matches;";
@matches = map { "'$_'" } $string =~ /$regex/m;
say "regex_modifiers='m'; \@matches=@matches;";
@matches = map { "'$_'" } $string =~ /$regex/s;
say "regex_modifiers='s'; \@matches=@matches;";
@matches = map { "'$_'" } $string =~ /$regex/ms;
say "regex_modifiers='ms'; \@matches=@matches;";
}
}
输出:
###
# $string='abcd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers=''; @matches='ab' 'cd'; # ok
regex_modifiers='m'; @matches='ab' 'cd'; # /m, /s modifiers do not matter in single-line string
regex_modifiers='s'; @matches='ab' 'cd'; # /m, /s modifiers do not matter in single-line string
regex_modifiers='ms'; @matches='ab' 'cd'; # /m, /s modifiers do not matter in single-line string
###
# $string='abcd
abcd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers=''; @matches=; # '.' does not match newline, cannot reach end of string
regex_modifiers='m'; @matches='ab' 'cd'; # '$' matches first newline
regex_modifiers='s'; @matches='ab' 'cd
abcd'; # '.' matches newline, so the end of string is reached
# and '$' matches it.
regex_modifiers='ms'; @matches='ab' 'cd'; # non-greedy '.*?' causes '$' to match the first newline
###
# $string='abXd
abcd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers=''; @matches='abXd
ab' 'cd'; # [^c] matches newline, /m, /s modifiers do not matter
regex_modifiers='m'; @matches='abXd
ab' 'cd'; # [^c] matches newline, /m, /s modifiers do not matter
regex_modifiers='s'; @matches='abXd
ab' 'cd'; # [^c] matches newline, /m, /s modifiers do not matter
regex_modifiers='ms'; @matches='abXd
ab' 'cd'; # [^c] matches newline, /m, /s modifiers do not matter
###
# $string='abcd
abXd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers=''; @matches=; # '.' does not match newline, cannot reach end of string
regex_modifiers='m'; @matches='ab' 'cd'; # matches second line
regex_modifiers='s'; @matches='ab' 'cd
abXd'; # '.' matches newline, so the end of string is reached
# and '$' matches it.
regex_modifiers='ms'; @matches='ab' 'cd'; # non-greedy '.*?' causes '$' to match the first newline
这与
根据perlre: Modifiers,
/m
正则表达式修饰符表示
Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
因此,用下面的代码:
#!/usr/bin/perl
use strict; use warnings;
$/ = ''; # one paragraph at a time
while(<DATA>)
{
print "original:\n";
print;
s/^([^B]*)(B.*?)$/>|</mg;
print "\n\nafter substitution:\n";
print;
}
__DATA__
aaaaBaBaBB
bbbbBbadbe
cccc
dddd
eeeeBeeeee
ffff
gggg
我希望正则表达式引擎的行为如下。
第 1 行:匹配,因为它在该行的开头和结尾之间找到了两个模式。
第 2 行:同上。
第 3 行:无 匹配。第一个正则表达式组(在第一组括号中)匹配。
但是当我们到达行尾时,
我们仍在寻找 B,开始
第二个正则表达式组。由于我们指定了 /m
,这一行的结尾意味着我们已经达到了 $
而没有满足整个模式。
第 4 行:我们开始新的一行,所以我们遇到了一个新的 ^
。再次,不匹配。
第 5 行:匹配。两个正则表达式组都位于行的开头和结尾之间,即 ^
和 $
之间,完全符合指定。
因此我希望看到
>aaaa|BaBaBB<
>bbbb|Bbadbe<
cccc
dddd
>eeee|Beeeee<
ffff
gggg
相反,似乎在第 3 行,引擎 忽略 行尾并搜索过去。
它将第 3--5 行视为单行,如果我们愿意突然忽略 $
表示行尾,则将满足正则表达式。这是我们看到的:
>aaaa|BaBaBB<
>bbbb|Bbadbe<
>cccc
dddd
eeee|Beeeee<
ffff
gggg
这与 /m
规范如何一致?此行为记录在何处?
> perl --version
This is perl 5, version 18, subversion 4 (v5.18.4) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)
[^B]*
将匹配尽可能多的 non-B 个字符,包括换行符。
用 [^B\n]*
替换它可能会做你想要的。
When a regexp can match a string in several different ways, we can use the principles above to predict which way the regexp will match:
Principle 0: Taken as a whole, any regexp will be matched at the earliest possible position in the string.
Principle 1: In an alternation a|b|c... , the leftmost alternative that allows a match for the whole regexp will be the one used.
Principle 2: The maximal matching quantifiers '?' , '*' , '+' and {n,m} will in general match as much of the string as possible while still allowing the whole regexp to match.
Principle 3: If there are two or more elements in a regexp, the leftmost greedy quantifier, if any, will match as much of the string as possible while still allowing the whole regexp to match. The next leftmost greedy quantifier, if any, will try to match as much of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.
As we have seen above, Principle 0 overrides the others. The regexp will be matched as early as possible, with the other principles determining how the regexp matches at that earliest character position.
[...]
We can modify principle 3 above to take into account non-greedy quantifiers:Principle 3: If there are two or more elements in a regexp, the leftmost greedy (non-greedy) quantifier, if any, will match as much (little) of the string as possible while still allowing the whole regexp to match. The next leftmost greedy (non-greedy) quantifier, if any, will try to match as much (little) of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.
所以对于这种情况
my $str = 'cccc
dddd
eeeeBeeeee
ffff
gggg';
$str =~ s/^([^B]*)(B.*?)$/>|</m;
我们使用原则 0 和原则 3,因此它将匹配 $str
中的开始位置(位置 0)。根据原则3,我们从最左边的元素开始:
^([^B]*)
它将匹配 "尽可能多的字符串,同时仍允许匹配整个正则表达式。",这意味着它将能够从字符串的开头匹配字符串和第一个 B
。然后引擎考虑下一个元素
(B.*?)$
仍然,根据原则 3:它将匹配 "尽可能少的字符串,同时仍允许匹配整个正则表达式。" 因此它将匹配 B
到找到的第一个新行。
/m
和 /s
修饰符和字符 classes 的 Perl 文档可以从连接点和添加更多示例中获益,我将在此处尝试。
不考虑/m
和/s
修饰符,字符class可以匹配换行符。 这就是为什么 [^B]*
匹配 \n
并在您的情况下通过多个换行符进行扩展。事实上,您可以指定一个字符 class 明确包含 ([\n]
) 或不包含 ([^\n]
) 换行符。除了换行符(\n
),还有一个non-newline字符(\N
)。
/s
修饰符 仅改变 .
的行为(它允许 .
匹配换行符) .它不会改变任何其他字符的行为 classes.
单独使用 /m
和 /s
修饰符可以获得明显不同的行为,如下例所示。此行为已记录在案,因此可预测,但并不总是直观的。我通常一起使用这些修饰符
(/ms
),发现它使我的代码更加直观和可维护。这样,我就不必每次都考虑换行匹配行为。事实上,出于习惯,我通常在我自己的代码中的大多数正则表达式中使用 /xms
修饰符,/x
允许代码更具可读性和可维护性(Conway(2005),第 236- 241, 罗马人 (2006).
参考资料:
perlrecharclass - Perl Regular Expression Character Classes: Backslash sequences
\N Match a character that isn't a newline.
perlreref - Perl Regular Expressions Reference: CHARACTER CLASSES
\N A non newline (when not followed by '{NAME}';;
not valid in a character class; equivalent to [^\n]; it's
like '.' without /s modifier)
perlre - Perl regular expressions: Modifiers
m
Treat the string being matched against as multiple lines. That is, change"^"
and"$"
from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
s
Treat the string as single line. That is, change"."
to match any character whatsoever, even a newline, which normally it would not match.
Used together, as
/ms
, they let the"."
match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.
(请注意,除了 '.'
之外,它没有说明 /m
或 /s
改变字符 classes,所以我们可以从这里推断它们没有被改变)
使用 /xms
修饰符:
- Always use the /x flag.
- Always use the /m flag.
- Always use the /s flag.
(康威 (2005),第 236-241 页,Vromans (2006))
Damian Conway (2005) Perl 最佳实践:开发可维护代码的标准和样式。奥莱利媒体。 https://www.amazon.com/Perl-Best-Practices-Developing-Maintainable/dp/0596001738/
Perl 最佳实践:参考指南:https://www.squirrel.nl/pub/PBP_refguide-1.02.00.pdf
示例:
use strict;
use warnings;
use feature qw( say );
my @strings = (
"abcd\n", # single-line string
"abcd\nabcd\n", # multi-line string (first string repeated twice)
"abXd\nabcd\n", # multi-line string, same as above, but missing first 'c'
"abcd\nabXd\n", # multi-line string, same as above, but missing first 'c'
);
my @regexes = ( '^([^c]*)(c.*?)$' );
foreach my $string ( @strings ) {
foreach my $regex ( @regexes ) {
my @matches;
say "\n###";
say "# $string='$string'; $regex='$regex'";
@matches = map { "'$_'" } $string =~ /$regex/;
say "regex_modifiers=''; \@matches=@matches;";
@matches = map { "'$_'" } $string =~ /$regex/m;
say "regex_modifiers='m'; \@matches=@matches;";
@matches = map { "'$_'" } $string =~ /$regex/s;
say "regex_modifiers='s'; \@matches=@matches;";
@matches = map { "'$_'" } $string =~ /$regex/ms;
say "regex_modifiers='ms'; \@matches=@matches;";
}
}
输出:
###
# $string='abcd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers=''; @matches='ab' 'cd'; # ok
regex_modifiers='m'; @matches='ab' 'cd'; # /m, /s modifiers do not matter in single-line string
regex_modifiers='s'; @matches='ab' 'cd'; # /m, /s modifiers do not matter in single-line string
regex_modifiers='ms'; @matches='ab' 'cd'; # /m, /s modifiers do not matter in single-line string
###
# $string='abcd
abcd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers=''; @matches=; # '.' does not match newline, cannot reach end of string
regex_modifiers='m'; @matches='ab' 'cd'; # '$' matches first newline
regex_modifiers='s'; @matches='ab' 'cd
abcd'; # '.' matches newline, so the end of string is reached
# and '$' matches it.
regex_modifiers='ms'; @matches='ab' 'cd'; # non-greedy '.*?' causes '$' to match the first newline
###
# $string='abXd
abcd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers=''; @matches='abXd
ab' 'cd'; # [^c] matches newline, /m, /s modifiers do not matter
regex_modifiers='m'; @matches='abXd
ab' 'cd'; # [^c] matches newline, /m, /s modifiers do not matter
regex_modifiers='s'; @matches='abXd
ab' 'cd'; # [^c] matches newline, /m, /s modifiers do not matter
regex_modifiers='ms'; @matches='abXd
ab' 'cd'; # [^c] matches newline, /m, /s modifiers do not matter
###
# $string='abcd
abXd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers=''; @matches=; # '.' does not match newline, cannot reach end of string
regex_modifiers='m'; @matches='ab' 'cd'; # matches second line
regex_modifiers='s'; @matches='ab' 'cd
abXd'; # '.' matches newline, so the end of string is reached
# and '$' matches it.
regex_modifiers='ms'; @matches='ab' 'cd'; # non-greedy '.*?' causes '$' to match the first newline