正则表达式中空行和空行的区别
Differences between empty and blank lines in regexps
SO 上已经有几个 good discussions of regular expressions and empty lines。如果这个问题重复,我会删除它。
谁能解释为什么这个脚本输出 5 3 4 5 4 3
而不是 4 3 4 4 4 3
?当我 运行 它在调试器中 $blank
和 $classyblank
停留在“4”(我认为这是正确的值)直到打印语句之前。
my ( $blank, $nonblank, $non_nonblank,
$classyblank, $classyspace, $blanketyblank ) = 0 ;
while (<DATA>) {
$blank++ if /\p{IsBlank}/ ; # POSIXly blank - 4?
$nonblank++ if /^\P{IsBlank}$/ ; # POSIXly non-blank - 3
$non_nonblank++ if not /\S/ ; # perlishly not non-blank - 4
$classyblank++ if /[[:blank:]]/ ; # older(?) charclass blankness - 4?
$classyspace++ if /^[[:space:]]$/ ; # older(?) charclass whitespace - 4
$blanketyblank++ if /^$/ ; # perlishly *really empty* - 3
}
print join " ", $blank, $nonblank, $non_nonblank,
$classyblank, $classyspace, $blanketyblank , "\n" ;
__DATA__
line above only has a linefeed this one is not blank because: words
this line is followed by a line with white space (you may need to add it)
then another blank line following this one
THE END :-\
这与 __DATA__
部分有关还是我误解了 POSIX 正则表达式?
ps:
如评论及时postelsewhere,"really empty"(/^$/
)可漏非空:
perl -E 'my $string = "\n" . "foo\n\n" ; say "empty" if $string =~ /^$/ ;'
perl -E 'my $string = "\n" . "bar\n\n" ; say "empty" if $string =~ /\A\z/ ;'
perl -E 'my $string = "\n" . "baz\n\n" ; say "empty" if $string =~ /\S/ ;'
/\p{IsBlank}/
不检查空字符串。 \p
匹配具有指定 Unicode 属性.
的字符
$ unichars '\p{IsBlank}' | cat
---- U+0009 CHARACTER TABULATION
---- U+0020 SPACE
---- U+00A0 NO-BREAK SPACE
---- U+1680 OGHAM SPACE MARK
---- U+2000 EN QUAD
---- U+2001 EM QUAD
---- U+2002 EN SPACE
---- U+2003 EM SPACE
---- U+2004 THREE-PER-EM SPACE
---- U+2005 FOUR-PER-EM SPACE
---- U+2006 SIX-PER-EM SPACE
---- U+2007 FIGURE SPACE
---- U+2008 PUNCTUATION SPACE
---- U+2009 THIN SPACE
---- U+200A HAIR SPACE
---- U+202F NARROW NO-BREAK SPACE
---- U+205F MEDIUM MATHEMATICAL SPACE
---- U+3000 IDEOGRAPHIC SPACE
它匹配 " \n"
因为 SPACE 有 IsBlank 属性.
/[[:blank:]]/
不检查空字符串。 [...]
匹配属于指定 class.
成员的字符
$ unichars '[[:blank:]]' | cat
---- U+0009 CHARACTER TABULATION
---- U+0020 SPACE
---- U+00A0 NO-BREAK SPACE
---- U+1680 OGHAM SPACE MARK
---- U+2000 EN QUAD
---- U+2001 EM QUAD
---- U+2002 EN SPACE
---- U+2003 EM SPACE
---- U+2004 THREE-PER-EM SPACE
---- U+2005 FOUR-PER-EM SPACE
---- U+2006 SIX-PER-EM SPACE
---- U+2007 FIGURE SPACE
---- U+2008 PUNCTUATION SPACE
---- U+2009 THIN SPACE
---- U+200A HAIR SPACE
---- U+202F NARROW NO-BREAK SPACE
---- U+205F MEDIUM MATHEMATICAL SPACE
---- U+3000 IDEOGRAPHIC SPACE
它匹配 " \n"
,因为 SPACE 是 [:blank:]
POSIX 字符 class 的成员,因此也是 [[:blank:]]
的成员字符 class.
SO 上已经有几个 good discussions of regular expressions and empty lines。如果这个问题重复,我会删除它。
谁能解释为什么这个脚本输出 5 3 4 5 4 3
而不是 4 3 4 4 4 3
?当我 运行 它在调试器中 $blank
和 $classyblank
停留在“4”(我认为这是正确的值)直到打印语句之前。
my ( $blank, $nonblank, $non_nonblank,
$classyblank, $classyspace, $blanketyblank ) = 0 ;
while (<DATA>) {
$blank++ if /\p{IsBlank}/ ; # POSIXly blank - 4?
$nonblank++ if /^\P{IsBlank}$/ ; # POSIXly non-blank - 3
$non_nonblank++ if not /\S/ ; # perlishly not non-blank - 4
$classyblank++ if /[[:blank:]]/ ; # older(?) charclass blankness - 4?
$classyspace++ if /^[[:space:]]$/ ; # older(?) charclass whitespace - 4
$blanketyblank++ if /^$/ ; # perlishly *really empty* - 3
}
print join " ", $blank, $nonblank, $non_nonblank,
$classyblank, $classyspace, $blanketyblank , "\n" ;
__DATA__
line above only has a linefeed this one is not blank because: words
this line is followed by a line with white space (you may need to add it)
then another blank line following this one
THE END :-\
这与 __DATA__
部分有关还是我误解了 POSIX 正则表达式?
ps:
如评论及时postelsewhere,"really empty"(/^$/
)可漏非空:
perl -E 'my $string = "\n" . "foo\n\n" ; say "empty" if $string =~ /^$/ ;'
perl -E 'my $string = "\n" . "bar\n\n" ; say "empty" if $string =~ /\A\z/ ;'
perl -E 'my $string = "\n" . "baz\n\n" ; say "empty" if $string =~ /\S/ ;'
/\p{IsBlank}/
不检查空字符串。 \p
匹配具有指定 Unicode 属性.
$ unichars '\p{IsBlank}' | cat
---- U+0009 CHARACTER TABULATION
---- U+0020 SPACE
---- U+00A0 NO-BREAK SPACE
---- U+1680 OGHAM SPACE MARK
---- U+2000 EN QUAD
---- U+2001 EM QUAD
---- U+2002 EN SPACE
---- U+2003 EM SPACE
---- U+2004 THREE-PER-EM SPACE
---- U+2005 FOUR-PER-EM SPACE
---- U+2006 SIX-PER-EM SPACE
---- U+2007 FIGURE SPACE
---- U+2008 PUNCTUATION SPACE
---- U+2009 THIN SPACE
---- U+200A HAIR SPACE
---- U+202F NARROW NO-BREAK SPACE
---- U+205F MEDIUM MATHEMATICAL SPACE
---- U+3000 IDEOGRAPHIC SPACE
它匹配 " \n"
因为 SPACE 有 IsBlank 属性.
/[[:blank:]]/
不检查空字符串。 [...]
匹配属于指定 class.
$ unichars '[[:blank:]]' | cat
---- U+0009 CHARACTER TABULATION
---- U+0020 SPACE
---- U+00A0 NO-BREAK SPACE
---- U+1680 OGHAM SPACE MARK
---- U+2000 EN QUAD
---- U+2001 EM QUAD
---- U+2002 EN SPACE
---- U+2003 EM SPACE
---- U+2004 THREE-PER-EM SPACE
---- U+2005 FOUR-PER-EM SPACE
---- U+2006 SIX-PER-EM SPACE
---- U+2007 FIGURE SPACE
---- U+2008 PUNCTUATION SPACE
---- U+2009 THIN SPACE
---- U+200A HAIR SPACE
---- U+202F NARROW NO-BREAK SPACE
---- U+205F MEDIUM MATHEMATICAL SPACE
---- U+3000 IDEOGRAPHIC SPACE
它匹配 " \n"
,因为 SPACE 是 [:blank:]
POSIX 字符 class 的成员,因此也是 [[:blank:]]
的成员字符 class.