使用 regex/grep/awk/sed 解析复杂的调试日志

Parsing complicated debug logs with regex/grep/awk/sed

我有 GB 大小的调试日志,其中包含大量无关数据。单个日志条目可以超过 1,000,000 行,有些部分有缩进,有些没有,除了每个条目开头的开始时间戳外,几乎没有一致性。每个新条目都以时间戳 ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah 开头,因此很容易识别,但后面可以有很多行属于它。我一直在使用 python 来定位文本字符串,然后向上移动找到它们所属的父条目,然后向下移动到 ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah 的下一个实例所在的条目的末尾,不幸的是,这几乎没有性能足以使这成为一个无痛的过程。我现在正试图让 grep 对正则表达式做同样的事情,因为 grep 似乎在速度方面处于不同的世界。另外,我 运行 遇到了 python 我正在使用的机器上的版本差异问题 (2vs3),这很痛苦。

这是我目前对 grep 的了解,它适用于小型测试用例,但不适用于大型文件,显然它在性能方面存在一些问题,我该如何解决?也许有一个好的方法可以用 awk 做到这一点?

grep -E "(?i)^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}[\s\S]+00:00:00:fc:77:00[\s\S]+?(?=^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}|\Z)" 我要查找的关键字符串是 00:00:00:fc:77:00

样本

2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22 
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...

如果其中任何一个有我的搜索字符串,我想要时间戳之间的整段内容,所有几千行。

假设:

  • 只有时间戳行以 ^YYYY/MM/DD
  • 开头
  • 时间戳行以 YYYY/MM/DD HH:MM:SS.sss {<thread_name>} 格式的字符串开头,并且该字符串在文件中是唯一的
  • 搜索字符串不包含嵌入的新行
  • 搜索字符串保证不会被日志文件中的换行符打断
  • 单个日志条目(OP:可以超过 1,000,000 行长)可能太大而无法放入内存
  • 需要搜索多个模式

设置:

$ cat search_strings
00:00:00:fc:77:00
and this line of text

$ cat log.txt
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH  should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH  should match on this: 00:00:00:fc:77:00
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH  should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...

注意: log.txt 行以 ^MATCH

开头

一个 awk 想法需要两次通过日志文件:

awk '
FNR==NR        { strings[[=11=]]; next }

FNR==1         { pass++ }

/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt =  FS  FS 
                                           print_header=1 
                                         }

pass==1        { for (string in strings)
                     if (match([=11=],string)) {           # search for our strings in current line and if found ...
                        dttlist[dtt]                   # save current date/time/thread
                        next
                     }
               }

pass==2 && 
(dtt in dttlist) { if (print_header) {
                      print "################# matching block:"
                      print_header=0
                   }
                   print
                 }
' search_strings log.txt log.txt

假设内存使用不是问题,另一个 awk 想法需要一次通过日志文件:

awk '

function print_lines() {

    if (lineno > 0)
       for (i=0; i<=lineno; i++)
           print lines[i]

    delete lines
    lineno=0
}

FNR==NR         { strings[[=12=]]; next }

/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt =  FS  FS  }

dtt != prev_dtt { delete lines
                  lineno=0
                  lines[0]="############# matching block:"

                  printme=0                     # disable printing of lines to stdout;
                                                # instead they will be saved to the lines[] array
                  prev_dtt=dtt
                }

! printme      { for (string in strings)
                     if (match([=12=],string)) { 
                        print_lines()          # flush any lines in the lines[] array and ...
                        printme=1              # set flag to print new lines to stdout
                        break
                     }
                 if (! printme)  {             # if not printing lines to stdout then ...
                    lines[++lineno]=[=12=]         # save the current line in the lines[] array
                 }
               }
printme                                        # printme==1 => print current line to stdout
' search_strings log.txt

这两个都会产生:

################# matching block:
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH  should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
################# matching block:
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH  should match on this: 00:00:00:fc:77:00
################# matching block:
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH  should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...