使用 regex/grep/awk/sed 解析复杂的调试日志

Question

我有 GB 大小的调试日志，其中包含大量无关数据。单个日志条目可以超过 1,000,000 行，有些部分有缩进，有些没有，除了每个条目开头的开始时间戳外，几乎没有一致性。每个新条目都以时间戳 ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah 开头，因此很容易识别，但后面可以有很多行属于它。我一直在使用 python 来定位文本字符串，然后向上移动找到它们所属的父条目，然后向下移动到 ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah 的下一个实例所在的条目的末尾，不幸的是，这几乎没有性能足以使这成为一个无痛的过程。我现在正试图让 grep 对正则表达式做同样的事情，因为 grep 似乎在速度方面处于不同的世界。另外，我运行遇到了 python 我正在使用的机器上的版本差异问题 (2vs3)，这很痛苦。

这是我目前对 grep 的了解，它适用于小型测试用例，但不适用于大型文件，显然它在性能方面存在一些问题，我该如何解决？也许有一个好的方法可以用 awk 做到这一点？

grep -E "(?i)^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}[\s\S]+00:00:00:fc:77:00[\s\S]+?(?=^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}|\Z)" 我要查找的关键字符串是 00:00:00:fc:77:00

样本

2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22 
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...

如果其中任何一个有我的搜索字符串，我想要时间戳之间的整段内容，所有几千行。

Answer 1

假设：

只有时间戳行以 ^YYYY/MM/DD
时间戳行以 YYYY/MM/DD HH:MM:SS.sss {<thread_name>} 格式的字符串开头，并且该字符串在文件中是唯一的
搜索字符串不包含嵌入的新行
搜索字符串保证不会被日志文件中的换行符打断
单个日志条目（OP：可以超过 1,000,000 行长）可能太大而无法放入内存
需要搜索多个模式

设置：

$ cat search_strings
00:00:00:fc:77:00
and this line of text

$ cat log.txt
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH  should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH  should match on this: 00:00:00:fc:77:00
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH  should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...

注意： log.txt 行以 ^MATCH

开头

一个 awk 想法需要两次通过日志文件：

awk '
FNR==NR        { strings[[=11=]]; next }

FNR==1         { pass++ }

/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt =  FS  FS 
                                           print_header=1 
                                         }

pass==1        { for (string in strings)
                     if (match([=11=],string)) {           # search for our strings in current line and if found ...
                        dttlist[dtt]                   # save current date/time/thread
                        next
                     }
               }

pass==2 && 
(dtt in dttlist) { if (print_header) {
                      print "################# matching block:"
                      print_header=0
                   }
                   print
                 }
' search_strings log.txt log.txt

假设内存使用不是问题，另一个 awk 想法需要一次通过日志文件：

awk '

function print_lines() {

    if (lineno > 0)
       for (i=0; i<=lineno; i++)
           print lines[i]

    delete lines
    lineno=0
}

FNR==NR         { strings[[=12=]]; next }

/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt =  FS  FS  }

dtt != prev_dtt { delete lines
                  lineno=0
                  lines[0]="############# matching block:"

                  printme=0                     # disable printing of lines to stdout;
                                                # instead they will be saved to the lines[] array
                  prev_dtt=dtt
                }

! printme      { for (string in strings)
                     if (match([=12=],string)) { 
                        print_lines()          # flush any lines in the lines[] array and ...
                        printme=1              # set flag to print new lines to stdout
                        break
                     }
                 if (! printme)  {             # if not printing lines to stdout then ...
                    lines[++lineno]=[=12=]         # save the current line in the lines[] array
                 }
               }
printme                                        # printme==1 => print current line to stdout
' search_strings log.txt

这两个都会产生：

################# matching block:
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH  should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
################# matching block:
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH  should match on this: 00:00:00:fc:77:00
################# matching block:
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH  should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...

使用 regex/grep/awk/sed 解析复杂的调试日志

Parsing complicated debug logs with regex/grep/awk/sed

regex

awk

grep