如何使用awk读取所有频繁时间间隔之间的数据

Question

我的日志文件格式如下

[30/Jan/2015:10:10:30 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 425
[30/Jan/2015:10:11:00 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 261
[30/Jan/2015:10:11:29 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 232
[30/Jan/2015:10:12:00 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 315
[30/Jan/2015:10:12:29 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 221
[30/Jan/2015:10:12:57 +0000] 12.30.30.182 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 218

此日志文件中的每一行在第一个字段中有时间戳，在最后一个字段中有响应时间。 awk 中有没有办法读取所有特定时间间隔内的平均响应时间？例如，根据日志文件中的时间戳计算每五分钟的平均响应时间。

或者除了 awk 之外还有其他最佳替代方法吗？请提出建议。

更新：

我尝试了以下方法，这是一种静态方法，只会给出一个时间间隔的平均值。

$ grep "30/Jan/2015:10:1[0-4]" mylog.log | awk '{resp+=$NF;cnt++;}END{print "Avg:"int(resp/cnt)}'

但我需要对整个文件执行全部 5 分钟的操作。即使我循环命令，如何将日期动态传递给命令？因为日志文件每次都不同，其中的日期也不同。

Answer 1

嗯。 GNU date 不喜欢你的日期格式，所以我想我们必须自己解析它。我正在考虑这些问题（这需要 gawk mktime）：

# returns the seconds since epoch that stamp represents. This will be
# the first field in the line, with [] and everything. It's rather
# rudimentary:
function parse_timestamp(stamp) {
  # Split stamp into tokens delimited by [, ], /, : or space
  split(stamp, c, "[][/: ]")

  # reassemble (using the lookup table for the months from below) in a
  # format that mktime understands (then call mktime).
  return mktime(c[4] " " mnums[c[3]] " " c[2] " " c[5] " " c[6] " " c[7])
}

BEGIN {
  # parse_timestamp needs this lookup table.
  split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", mnames)
  for(i = 1; i <= length(mnames); ++i) {
    mnums[mnames[i]] = i
  }

  # time is a parameter supplied by you.
  start = parse_timestamp(time)
  end   = start + 300

  if(start == -1) {
    print "Warning: Could not parse timestamp \"" time "\""
  }
}

{ 
  # in each line: parse the timestamp
  curtime = parse_timestamp()
}

# if it lies in the interval you want, sum up the last field and increase
# the counter
curtime >= start && curtime < end {
  sum += $NF
  ++count
}

END {
  # and in the end, print the average.
  print "Avg: " (count == 0 ? "undef" : sum / count)
}

将其放入文件中，说 average.awk，然后调用

awk -v time='[30/Jan/2015:10:11:20 +0000]' -f average.awk foo.log

如果您确定日志文件将按升序排序（可能是这种情况），您可以通过替换

来提高效率

curtime >= start && curtime < end {
  sum += $NF
  ++count
}

和

curtime >= end {
  exit
}

curtime >= start {
  sum += $NF
  ++count
}

这将在找到第一个位于您要查找的范围之后的日志条目后停止搜索合适的日志条目。

附录：由于 OP 澄清说他希望在排序的 makefile 中对所有五分钟间隔进行摘要，因此调整后的脚本是

#!/usr/bin/awk -f

function parse_timestamp(stamp) {
  split(stamp, c, "[][/: ]")
  return mktime(c[4] " " mnums[c[3]] " " c[2] " " c[5] " " c[6] " " c[7])
}

BEGIN {
  split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", mnames)
  for(i = 1; i <= length(mnames); ++i) {
    mnums[mnames[i]] = i
  }
}

{ 
  curtime = parse_timestamp()
}

NR == 1 {
  # pull the start time from the first line
  start = curtime
  end   = start + 300
}

curtime > end {
  # print result, reset counters when endtimes are past
  print "Avg: " (count == 0 ? "undef" : sum / count)
  sum   = 0
  count = 0
  end  += 300
}

{
  sum += $NF
  ++count
}

END {
  # print once more at the very end for the last, unfinished interval.
  print "Avg: " (count == 0 ? "undef" : sum / count)
}

如何使用awk读取所有频繁时间间隔之间的数据

How to use awk to read data between all frequent time intervals

unix

bash

awk

unix-timestamp