可能不存在的正则表达式捕获组

Question

我是运行一些日志文件的正则表达式。捕获组应该捕获一些相关字段。我想知道日志文件是否提到了作业的成功结束。这可以通过是否存在字符串 "Job executed successfully"

来得出结论

到目前为止我的正则表达式： ^Job started at\s'(\d+\s\d+:\d+:\d+:\d+)'\s+orderno\s+-\s+'(\w+)'\s+runno\s+-\s+'(\d+)'[\s\S]+Host1\s'([\w.]+)'\[([\w-]+)\] username '([\w\]+)' - Host2\s'([\w.]+)'\[([\w-]+)\] username '([\w\]+)'[\s\S]+(Job executed successfully)?[\s\S]+Job ended at\s'(\d+\s\d+:\d+:\d+:\d+)'\s+Elapsed time\s\[([\d.]+)sec\]\sCPU usage\s\[([\d.]+)sec]

（我是正则表达式的新手，所以它一点也不完美，需要一些强化）

成功结束的示例日志：上面的正则表达式只有在“（作业执行成功）？”后面的问号时才有效。已删除，我认为不需要。

Job started at '0902 23:56:00:367' orderno - '0tzh0' runno - '00064' Number of transfers - 1

Host1 'Local'[Windows-LOCAL] username 'xxx\xxx' - Host2 'xxx.xxx.xx'[Unix-SFTP] username 'xxx'

Local host is: xxx - Windows 200x [601] Service Pack 1 build 7601 - Intel64 Family 6 Model 37 Stepping 1, GenuineIntel

********** Starting transfer #1 out of 1 *************** Transfer #1 completed successfully

Job executed successfully. exiting.

Job ended at '0902 23:56:07:138' Elapsed time [7sec] CPU usage [0.15sec]

一个以失败告终的示例日志：上面的正则表达式可以正常工作。

Job started at '0831 15:26:00:365' orderno - '0tuq5' runno - '00030' Number of transfers - 4

Host1 'Local'[Windows-LOCAL] username 'xxx\xxx' - Host2 'xxx.xxx.xx'[Unix-SFTP] username 'xxx'

Local host is: xxx - Windows 200x [601] Service Pack 1 build 7601 - Intel64 Family 6 Model 37 Stepping 1, GenuineIntel

********** Starting transfer #1 out of 4 *************** Unable to connect to SSH server on 'xxx.xxx.xx': SFTP_Connect : psftp_connect failed : ssh_init: Network error: Connection timed out .

Connection to host sftp.onenet.be could not be established

Job ended at '0831 15:26:21:426'

Elapsed time [21sec] CPU usage [0.0sec]

Answer 1

如果您使用 PCRE，您可以使用令人难以置信的 \Q...\E 序列和一个否定。前瞻：

^\QJob started\E
(?:(?!\QJob ended\E).)+?
^\QJob executed successfully\E

参见 a demo on regex101.com（注意 multiline、verbose 和 singleline 修饰符！）。

如果不是，整个表达式会变得有些不可读：

^Job started(?:(?!Job ended).)+?^Job executed successfully

Answer 2

只需对您的正则表达式进行最少的更改，您就可以使用这个：

^Job started at\s'(\d+\s\d+:\d+:\d+:\d+)'\s+orderno\s+-\s+'(\w+)'\s+runno\s+-\s+'(\d+)'[\s\S]+?Host1\s'([\w.]+)'\[([\w-]+)\] username '([\w\]+)' - Host2\s'([\w.]+)'\[([\w-]+?)\] username '([\w\]+)'[\s\S]+?(?:(Job executed successfully)[\s\S]+?)?Job ended at\s'(\d+\s\d+:\d+:\d+:\d+)'\s+Elapsed time\s\[([\d.]+)sec\]\sCPU usage\s\[([\d.]+)sec]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------^^^-----------------------------------^^

（以上^表示的主要变化）

我还将一些量词转换为惰性量词，这应该会使事情变得更快。

regex101 demo

由于 [\s\S]+ 的贪婪匹配和回溯（从右到左）并测试 (Job executed successfully)?[\s\S]+，您当前的正则表达式将匹配所有内容直到最后，[\s\S]+ 将找到 Job ended 后立即匹配。

在上面的方法中，我们从左到右检查每个字符，直到到达我们需要的部分，即 Job executed successfully 如果它存在。

可能不存在的正则表达式捕获组

regex capture group which might not be present

regex

logfile

capturing-group