正则表达式匹配多行
Regex matching over multiple lines
我目前正在尝试对 pdf 进行一些基本清理,以便将其转换为 ePub 以用于我的 e-reader。我所做的只是删除页码(简单)和脚注(到目前为止很难)。基本上,我想要一个在每个脚注开头找到标记模式的表达式( <bar>
后跟换行符、数字和字母或引号),selects pattern 及其之后的所有内容,直到它到达下一页开头的 <hr/1>
标记。这是一些示例文本:
The phantoms, for so they then seemed, were flitting on the other side of <br>
the deck, and, with a noiseless celerity, were casting loose the tackles and bands <br>
of the boat which swung there. This boat had always been deemed one of the spare boats <br>
technically called the captain’s, on account of its hanging from the starboard quarter.<br>
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>
因为所有脚注都是这样格式化的,所以我想 select 每组以 <br>
开头(注意 space)并以 [= 结尾的行14=] 标签。这是我第一次真正尝试使用正则表达式,所以我尝试将一些解决方案的尝试组合在一起:
\s<br>\n\d+\s[a-zA-Z“].*
:这正确地 selects <br>
和脚注的第一行,但在中断处停止。 \s<br>\n\d+\s[a-zA-Z“].*\n.*\n.*\n.*\n.*\n.*
select 正确的行数,但这显然只适用于恰好有三行文本的脚注。
\s<br>\n\d+\s[a-zA-Z“]((.*\n)*)<hr\/>
从第一个脚注的正确位置开始,但随后 select 结束了整个文档的其余部分。我对这个表达式的解释是“以 <br>
开头,一个数字后跟一个 space 后跟一个字母或引号,然后是 select 包括换行符在内的所有内容,直到到达 <hr/>
."
\s<br>\n\d+\s[a-zA-Z“]((?:.*\r?\n?)*)<hr\/>\n
与 (2) 的想法相同,结果相同,尽管我对正则表达式不够熟悉,无法完全理解这个正则表达式是怎么回事。
基本上,我的问题是我的表达式要么排除换行符(并忽略结束模式),要么包含 每个 换行符和 return 整个文本(显然仍然忽略了结束模式。
如何只return 模式之间的文本,包括换行符?
您的尝试非常接近。在第一个中,您可能需要设置允许 .
匹配换行符的标志。通常不会。在你的第二个中,你需要在任何匹配 .*
上设置非贪婪 ?
模式。否则 .*
会尝试匹配整个文本的其余部分。
应该是这样的。 /^ <br>\n\d+\s[a-zA-Z"“](.*?\n)*?<hr\/>/
但无论如何,这是最好用 Perl 完成的事情。 Perl 是所有高级正则表达式的来源。
use strict;
use diagnostics;
our $text =<<EOF;
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>
More text.
EOF
our $regex = qr{^ <br>\n\d+ +[A-Z"“].*?<hr/>}ism;
$text =~ s/($regex)/<!-- Removed -->/;
print "Removed text:\n[]\n\n";
print "New text:\n[$text]\n";
打印:
Removed text:
[ <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>]
New text:
[The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<!-- Removed -->
More text.
]
qr
运算符构建一个正则表达式,以便它可以存储在一个变量中。开头的 ^
表示将此匹配锚定在一行的开头。最后的 ism
代表 i
不区分大小写,s
单个字符串,m
多个嵌入行。 s
允许 .
匹配换行符。 m
允许 ^
匹配字符串中嵌入的行的开头。您可以在替换的末尾添加一个 g
标志以进行全局替换。 s///g
Perl 正则表达式文档解释了一切。
https://perldoc.perl.org/perlretut
另见 。
HTH
我目前正在尝试对 pdf 进行一些基本清理,以便将其转换为 ePub 以用于我的 e-reader。我所做的只是删除页码(简单)和脚注(到目前为止很难)。基本上,我想要一个在每个脚注开头找到标记模式的表达式( <bar>
后跟换行符、数字和字母或引号),selects pattern 及其之后的所有内容,直到它到达下一页开头的 <hr/1>
标记。这是一些示例文本:
The phantoms, for so they then seemed, were flitting on the other side of <br>
the deck, and, with a noiseless celerity, were casting loose the tackles and bands <br>
of the boat which swung there. This boat had always been deemed one of the spare boats <br>
technically called the captain’s, on account of its hanging from the starboard quarter.<br>
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>
因为所有脚注都是这样格式化的,所以我想 select 每组以 <br>
开头(注意 space)并以 [= 结尾的行14=] 标签。这是我第一次真正尝试使用正则表达式,所以我尝试将一些解决方案的尝试组合在一起:
\s<br>\n\d+\s[a-zA-Z“].*
:这正确地 selects<br>
和脚注的第一行,但在中断处停止。\s<br>\n\d+\s[a-zA-Z“].*\n.*\n.*\n.*\n.*\n.*
select 正确的行数,但这显然只适用于恰好有三行文本的脚注。\s<br>\n\d+\s[a-zA-Z“]((.*\n)*)<hr\/>
从第一个脚注的正确位置开始,但随后 select 结束了整个文档的其余部分。我对这个表达式的解释是“以<br>
开头,一个数字后跟一个 space 后跟一个字母或引号,然后是 select 包括换行符在内的所有内容,直到到达<hr/>
."\s<br>\n\d+\s[a-zA-Z“]((?:.*\r?\n?)*)<hr\/>\n
与 (2) 的想法相同,结果相同,尽管我对正则表达式不够熟悉,无法完全理解这个正则表达式是怎么回事。
基本上,我的问题是我的表达式要么排除换行符(并忽略结束模式),要么包含 每个 换行符和 return 整个文本(显然仍然忽略了结束模式。
如何只return 模式之间的文本,包括换行符?
您的尝试非常接近。在第一个中,您可能需要设置允许 .
匹配换行符的标志。通常不会。在你的第二个中,你需要在任何匹配 .*
上设置非贪婪 ?
模式。否则 .*
会尝试匹配整个文本的其余部分。
应该是这样的。 /^ <br>\n\d+\s[a-zA-Z"“](.*?\n)*?<hr\/>/
但无论如何,这是最好用 Perl 完成的事情。 Perl 是所有高级正则表达式的来源。
use strict;
use diagnostics;
our $text =<<EOF;
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>
More text.
EOF
our $regex = qr{^ <br>\n\d+ +[A-Z"“].*?<hr/>}ism;
$text =~ s/($regex)/<!-- Removed -->/;
print "Removed text:\n[]\n\n";
print "New text:\n[$text]\n";
打印:
Removed text:
[ <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>]
New text:
[The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<!-- Removed -->
More text.
]
qr
运算符构建一个正则表达式,以便它可以存储在一个变量中。开头的 ^
表示将此匹配锚定在一行的开头。最后的 ism
代表 i
不区分大小写,s
单个字符串,m
多个嵌入行。 s
允许 .
匹配换行符。 m
允许 ^
匹配字符串中嵌入的行的开头。您可以在替换的末尾添加一个 g
标志以进行全局替换。 s///g
Perl 正则表达式文档解释了一切。 https://perldoc.perl.org/perlretut
另见
HTH