使用 regex/grep/awk/sed 解析复杂的调试日志
Parsing complicated debug logs with regex/grep/awk/sed
我有 GB 大小的调试日志,其中包含大量无关数据。单个日志条目可以超过 1,000,000 行,有些部分有缩进,有些没有,除了每个条目开头的开始时间戳外,几乎没有一致性。每个新条目都以时间戳 ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah
开头,因此很容易识别,但后面可以有很多行属于它。我一直在使用 python 来定位文本字符串,然后向上移动找到它们所属的父条目,然后向下移动到 ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah
的下一个实例所在的条目的末尾,不幸的是,这几乎没有性能足以使这成为一个无痛的过程。我现在正试图让 grep 对正则表达式做同样的事情,因为 grep 似乎在速度方面处于不同的世界。另外,我 运行 遇到了 python 我正在使用的机器上的版本差异问题 (2vs3),这很痛苦。
这是我目前对 grep 的了解,它适用于小型测试用例,但不适用于大型文件,显然它在性能方面存在一些问题,我该如何解决?也许有一个好的方法可以用 awk 做到这一点?
grep -E "(?i)^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}[\s\S]+00:00:00:fc:77:00[\s\S]+?(?=^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}|\Z)"
我要查找的关键字符串是 00:00:00:fc:77:00
样本
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
如果其中任何一个有我的搜索字符串,我想要时间戳之间的整段内容,所有几千行。
假设:
- 只有时间戳行以
^YYYY/MM/DD
开头
- 时间戳行以
YYYY/MM/DD HH:MM:SS.sss {<thread_name>}
格式的字符串开头,并且该字符串在文件中是唯一的
- 搜索字符串不包含嵌入的新行
- 搜索字符串保证不会被日志文件中的换行符打断
- 单个日志条目(OP:可以超过 1,000,000 行长)可能太大而无法放入内存
- 需要搜索多个模式
设置:
$ cat search_strings
00:00:00:fc:77:00
and this line of text
$ cat log.txt
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH should match on this: 00:00:00:fc:77:00
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
注意: log.txt
行以 ^MATCH
开头
一个 awk
想法需要两次通过日志文件:
awk '
FNR==NR { strings[[=11=]]; next }
FNR==1 { pass++ }
/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt = FS FS
print_header=1
}
pass==1 { for (string in strings)
if (match([=11=],string)) { # search for our strings in current line and if found ...
dttlist[dtt] # save current date/time/thread
next
}
}
pass==2 &&
(dtt in dttlist) { if (print_header) {
print "################# matching block:"
print_header=0
}
print
}
' search_strings log.txt log.txt
假设内存使用不是问题,另一个 awk
想法需要一次通过日志文件:
awk '
function print_lines() {
if (lineno > 0)
for (i=0; i<=lineno; i++)
print lines[i]
delete lines
lineno=0
}
FNR==NR { strings[[=12=]]; next }
/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt = FS FS }
dtt != prev_dtt { delete lines
lineno=0
lines[0]="############# matching block:"
printme=0 # disable printing of lines to stdout;
# instead they will be saved to the lines[] array
prev_dtt=dtt
}
! printme { for (string in strings)
if (match([=12=],string)) {
print_lines() # flush any lines in the lines[] array and ...
printme=1 # set flag to print new lines to stdout
break
}
if (! printme) { # if not printing lines to stdout then ...
lines[++lineno]=[=12=] # save the current line in the lines[] array
}
}
printme # printme==1 => print current line to stdout
' search_strings log.txt
这两个都会产生:
################# matching block:
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
################# matching block:
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH should match on this: 00:00:00:fc:77:00
################# matching block:
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
我有 GB 大小的调试日志,其中包含大量无关数据。单个日志条目可以超过 1,000,000 行,有些部分有缩进,有些没有,除了每个条目开头的开始时间戳外,几乎没有一致性。每个新条目都以时间戳 ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah
开头,因此很容易识别,但后面可以有很多行属于它。我一直在使用 python 来定位文本字符串,然后向上移动找到它们所属的父条目,然后向下移动到 ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah
的下一个实例所在的条目的末尾,不幸的是,这几乎没有性能足以使这成为一个无痛的过程。我现在正试图让 grep 对正则表达式做同样的事情,因为 grep 似乎在速度方面处于不同的世界。另外,我 运行 遇到了 python 我正在使用的机器上的版本差异问题 (2vs3),这很痛苦。
这是我目前对 grep 的了解,它适用于小型测试用例,但不适用于大型文件,显然它在性能方面存在一些问题,我该如何解决?也许有一个好的方法可以用 awk 做到这一点?
grep -E "(?i)^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}[\s\S]+00:00:00:fc:77:00[\s\S]+?(?=^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}|\Z)"
我要查找的关键字符串是 00:00:00:fc:77:00
样本
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
如果其中任何一个有我的搜索字符串,我想要时间戳之间的整段内容,所有几千行。
假设:
- 只有时间戳行以
^YYYY/MM/DD
开头
- 时间戳行以
YYYY/MM/DD HH:MM:SS.sss {<thread_name>}
格式的字符串开头,并且该字符串在文件中是唯一的 - 搜索字符串不包含嵌入的新行
- 搜索字符串保证不会被日志文件中的换行符打断
- 单个日志条目(OP:可以超过 1,000,000 行长)可能太大而无法放入内存
- 需要搜索多个模式
设置:
$ cat search_strings
00:00:00:fc:77:00
and this line of text
$ cat log.txt
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH should match on this: 00:00:00:fc:77:00
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
注意: log.txt
行以 ^MATCH
一个 awk
想法需要两次通过日志文件:
awk '
FNR==NR { strings[[=11=]]; next }
FNR==1 { pass++ }
/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt = FS FS
print_header=1
}
pass==1 { for (string in strings)
if (match([=11=],string)) { # search for our strings in current line and if found ...
dttlist[dtt] # save current date/time/thread
next
}
}
pass==2 &&
(dtt in dttlist) { if (print_header) {
print "################# matching block:"
print_header=0
}
print
}
' search_strings log.txt log.txt
假设内存使用不是问题,另一个 awk
想法需要一次通过日志文件:
awk '
function print_lines() {
if (lineno > 0)
for (i=0; i<=lineno; i++)
print lines[i]
delete lines
lineno=0
}
FNR==NR { strings[[=12=]]; next }
/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt = FS FS }
dtt != prev_dtt { delete lines
lineno=0
lines[0]="############# matching block:"
printme=0 # disable printing of lines to stdout;
# instead they will be saved to the lines[] array
prev_dtt=dtt
}
! printme { for (string in strings)
if (match([=12=],string)) {
print_lines() # flush any lines in the lines[] array and ...
printme=1 # set flag to print new lines to stdout
break
}
if (! printme) { # if not printing lines to stdout then ...
lines[++lineno]=[=12=] # save the current line in the lines[] array
}
}
printme # printme==1 => print current line to stdout
' search_strings log.txt
这两个都会产生:
################# matching block:
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
################# matching block:
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH should match on this: 00:00:00:fc:77:00
################# matching block:
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...