Perl 正则表达式重复匹配
Perl regex repetition match
我在使用“?”时遇到了奇怪的行为正则表达式重复。我正在处理日志文件,我正在其中搜索特定的 HTTP 错误响应,例如。 401. 该行可以但不可以包含响应正文。所以我想匹配这两种情况。
我有以下代码。
#!/usr/bin/perl
$match = 'response 401';
$line = '2021-04-08 07:15:01 | INFO | [http-nio-8080-exec-11] | rId:123456789 | ip:127.0.0.1 | activationId: abcdefg | user: admin | response 401: headers: Cache-Control: [no-cache, no-store, max-age=0, must-revalidate] / Content-Length: [60] / Content-Type: [application/json;charset=UTF-8] / Date: [Thu, 08 Apr 2021 05:15:01 GMT] / Expires: [0] / Pragma: [no-cache] | body: {"errors":[{"message":"Bad credentials","repeatable":true}]}';
my($tstamp, $level, $thread, $body) = $line =~ m/^(.*?)\s+\|\s+(\w+)\s+\|\s+\[(.*?)\].*?$match.*?(?:body\:\s+({.*}))?/;
if($body) {
print "body: $body\n";
}
这不会打印任何内容。我希望它应该作为 .*?$match.*?
应该匹配行的最小部分并为 body
模式留出足够的空间。但显然不会。
当我更改正则表达式并从 body
模式中删除 ?
并使其成为强制行匹配时。
my($tstamp, $level, $thread, $body) = $line =~ m/^(.*?)\s+\|\s+(\w+)\s+\|\s+\[(.*?)\].*?$match.*?(?:body\:\s+({.*}))/;
但这不会匹配没有 body
的行。
正则表达式有什么问题?我怀疑 (?:body...)?
模式之前的非贪婪 .*?
模式吃掉了输入,因为可选主体没问题。
如何写出正确的正则表达式?
您可以使用第 4 组的可选部分并断言字符串的结尾。
^(.*?)\s+\|\s+(\w+)\s+\|\s+\[([^][]*)\].*?(?:\s+\|\s+body:\s+({.*}))?$
^
字符串开头
(.*?)\s+\|
捕获 组 1 尽可能匹配任何字符并匹配空格和 |
\s+(\w+)\s+\|
匹配空格并捕获 组 2 中的 1+ 个单词字符并匹配空格和 |
\s+\[([^][]*)\]
匹配空格并捕获 第 3 组 中 [...]
之间的所有空格
.*?
尽可能少地匹配任何字符
(?:\s+\|\s+body:\h+({.*}))?
可选择在空格之间匹配 |
,body:
并捕获 组 4[=52 中 {...}
之间的所有内容=]
$
字符串结束
使用示例代码:
$match = 'response 401';
$line = '2021-04-08 07:15:01 | INFO | [http-nio-8080-exec-11] | rId:123456789 | ip:127.0.0.1 | activationId: abcdefg | user: admin | response 401: headers: Cache-Control: [no-cache, no-store, max-age=0, must-revalidate] / Content-Length: [60] / Content-Type: [application/json;charset=UTF-8] / Date: [Thu, 08 Apr 2021 05:15:01 GMT] / Expires: [0] / Pragma: [no-cache] | body: {"errors":[{"message":"Bad credentials","repeatable":true}]}';
my($tstamp, $level, $thread, $body) = $line =~ m/^(.*?)\s+\|\s+(\w+)\s+\|\s+\[([^][]*)\].*?(?:\s+\|\s+body:\s+({.*}))?$/;
if($body) {
print "body: $body\n";
}
输出
body: {"errors":[{"message":"Bad credentials","repeatable":true}]}
如果没有body,仍然可以获得$tstamp
、$level
和$thread
的值
根据您展示的示例,请您尝试以下操作。
^(\d{4}-\d{2}-\d{2}\s*(?:\d{2}:){2}\d{2})\s+\|\s+(\S+)\s+\|\s+\[([^]]*)\].*?(body.*)?$
Here is online demo for above regex
解释: 为以上添加详细解释。
^ ##Matching starting of value by caret sign.
(\d{4}-\d{2}-\d{2}\s*(?:\d{2}:){2}\d{2}) ##Creating 1st capturing group to match time stamp here.
\s+\|\s+ ##Matching spaces pipe spaces(one or more occurrences).
(\S+) ##Creating 2nd capturing group which has everything apart from space, which will have INFO/WARN/ERROR etc here.
\s+\|\s+\[ ##Matching spaces pipe spaces(one or more occurrences).
([^]]*) ##Creating 3rd capturing group which has everything till ] occurrence in it.
\].*? ##Matching ] with lazy match.
(body.*)?$ ##Creating 4th capturing group which will match from body to till end of line and keeping it optional at the end of the line/value.
我在使用“?”时遇到了奇怪的行为正则表达式重复。我正在处理日志文件,我正在其中搜索特定的 HTTP 错误响应,例如。 401. 该行可以但不可以包含响应正文。所以我想匹配这两种情况。 我有以下代码。
#!/usr/bin/perl
$match = 'response 401';
$line = '2021-04-08 07:15:01 | INFO | [http-nio-8080-exec-11] | rId:123456789 | ip:127.0.0.1 | activationId: abcdefg | user: admin | response 401: headers: Cache-Control: [no-cache, no-store, max-age=0, must-revalidate] / Content-Length: [60] / Content-Type: [application/json;charset=UTF-8] / Date: [Thu, 08 Apr 2021 05:15:01 GMT] / Expires: [0] / Pragma: [no-cache] | body: {"errors":[{"message":"Bad credentials","repeatable":true}]}';
my($tstamp, $level, $thread, $body) = $line =~ m/^(.*?)\s+\|\s+(\w+)\s+\|\s+\[(.*?)\].*?$match.*?(?:body\:\s+({.*}))?/;
if($body) {
print "body: $body\n";
}
这不会打印任何内容。我希望它应该作为 .*?$match.*?
应该匹配行的最小部分并为 body
模式留出足够的空间。但显然不会。
当我更改正则表达式并从 body
模式中删除 ?
并使其成为强制行匹配时。
my($tstamp, $level, $thread, $body) = $line =~ m/^(.*?)\s+\|\s+(\w+)\s+\|\s+\[(.*?)\].*?$match.*?(?:body\:\s+({.*}))/;
但这不会匹配没有 body
的行。
正则表达式有什么问题?我怀疑 (?:body...)?
模式之前的非贪婪 .*?
模式吃掉了输入,因为可选主体没问题。
如何写出正确的正则表达式?
您可以使用第 4 组的可选部分并断言字符串的结尾。
^(.*?)\s+\|\s+(\w+)\s+\|\s+\[([^][]*)\].*?(?:\s+\|\s+body:\s+({.*}))?$
^
字符串开头(.*?)\s+\|
捕获 组 1 尽可能匹配任何字符并匹配空格和|
\s+(\w+)\s+\|
匹配空格并捕获 组 2 中的 1+ 个单词字符并匹配空格和|
\s+\[([^][]*)\]
匹配空格并捕获 第 3 组 中 .*?
尽可能少地匹配任何字符(?:\s+\|\s+body:\h+({.*}))?
可选择在空格之间匹配|
,body:
并捕获 组 4[=52 中{...}
之间的所有内容=]$
字符串结束
[...]
之间的所有空格
使用示例代码:
$match = 'response 401';
$line = '2021-04-08 07:15:01 | INFO | [http-nio-8080-exec-11] | rId:123456789 | ip:127.0.0.1 | activationId: abcdefg | user: admin | response 401: headers: Cache-Control: [no-cache, no-store, max-age=0, must-revalidate] / Content-Length: [60] / Content-Type: [application/json;charset=UTF-8] / Date: [Thu, 08 Apr 2021 05:15:01 GMT] / Expires: [0] / Pragma: [no-cache] | body: {"errors":[{"message":"Bad credentials","repeatable":true}]}';
my($tstamp, $level, $thread, $body) = $line =~ m/^(.*?)\s+\|\s+(\w+)\s+\|\s+\[([^][]*)\].*?(?:\s+\|\s+body:\s+({.*}))?$/;
if($body) {
print "body: $body\n";
}
输出
body: {"errors":[{"message":"Bad credentials","repeatable":true}]}
如果没有body,仍然可以获得$tstamp
、$level
和$thread
根据您展示的示例,请您尝试以下操作。
^(\d{4}-\d{2}-\d{2}\s*(?:\d{2}:){2}\d{2})\s+\|\s+(\S+)\s+\|\s+\[([^]]*)\].*?(body.*)?$
Here is online demo for above regex
解释: 为以上添加详细解释。
^ ##Matching starting of value by caret sign.
(\d{4}-\d{2}-\d{2}\s*(?:\d{2}:){2}\d{2}) ##Creating 1st capturing group to match time stamp here.
\s+\|\s+ ##Matching spaces pipe spaces(one or more occurrences).
(\S+) ##Creating 2nd capturing group which has everything apart from space, which will have INFO/WARN/ERROR etc here.
\s+\|\s+\[ ##Matching spaces pipe spaces(one or more occurrences).
([^]]*) ##Creating 3rd capturing group which has everything till ] occurrence in it.
\].*? ##Matching ] with lazy match.
(body.*)?$ ##Creating 4th capturing group which will match from body to till end of line and keeping it optional at the end of the line/value.