tidyverse 概念的 awk 等价物(熔化和扩散)

awk equivalents for tidyverse concepts (melt and spread)

我有一些文本日志需要解析并格式化为 CSV。 我有一个可用的 R 脚本,但一旦文件大小增加,它就会变慢,据我所知,这个问题似乎是使用 awk(或其他命令行工具?)加速的一个很好的候选者。

我没有对 awk 做太多,我遇到的问题是将我对 R 中处理的看法转化为 awk 脚本的完成方式。

示例截断输入数据 (Scrap.log):

; these are comment lines
; *******************************************************************************
; \C:\Users\Computer\Folder\Folder\Scrap.log

!!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2 
_START Header1 Header2 Header3 Header4 Header5 Header6 Header7 
10 12.23 1.91 6.63 1.68 50.03 0.50 13.97
11 11.32 1.94 6.64 1.94 50.12 0.58 15.10
12 12.96 2.15 6.57 2.12 55.60 0.62 16.24
13 11.43 2.18 6.60 2.36 50.89 0.68 17.39
14 14.91 2.32 6.64 2.59 56.09 0.73 18.41
15 13.16 2.38 6.53 2.85 51.62 0.81 19.30
16 15.02 2.50 6.67 3.05 56.22 0.85 20.12

!!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2 
_START Header8 Header9 Header10 Header11 Header12 Header13 Header14
10 22.03 24.41 15.01 51.44 44.28 16.57 11.52
11 21.05 24.62 15.62 51.23 45.42 16.47 11.98
12 20.11 24.64 16.38 52.16 46.59 16.54 12.42
13 24.13 24.93 17.23 52.34 47.72 16.51 12.88
14 27.17 24.95 18.06 52.79 48.72 16.45 13.30
15 22.87 25.04 19.27 53.01 49.50 16.47 13.63
16 23.08 25.22 20.12 53.75 50.64 16.55 14.03

预期输出(截断):

HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header1,12.23
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header2,1.91
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header3,6.63
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header4,1.68
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header5,50.03
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header6,0.5
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header7,13.97
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header1,11.32
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header2,1.94
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header3,6.64
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header4,1.94
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header5,50.12
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header6,0.58
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header7,15.1
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header1,12.96
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header2,2.15
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header3,6.57
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header4,2.12
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header5,55.6
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header6,0.62
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header7,16.24
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header1,11.43
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header2,2.18
...

我在 R 脚本中的一般步骤:

  1. 在文件顶部使用新名称添加一个 header 行
  2. spread顶行(以!!G开头)到每一行
  3. melt header 列 (_START) 从宽格式到长格式

到目前为止,我在 awk 工作的作品包括:

  1. 如何抓取并打印 header 行

awk '/_START/ {header = [=15=]; print header}' Scrap.log

  1. 如何使用新的 header 值写入单行

awk ' BEGIN{ ORS=" "; for (counter = 1; counter <= 14; counter++) print "HH",counter;}'

  1. 我知道每个块都由换行符分隔并以 !!G 开头,因此可以在上面写一个匹配项。不确定 split-apply-combine 类型的思维方式是否适用于 awk?

awk '/!!G/,/\n/ {print}' Scrap.log

或者,我尝试设置 RS/FS 参数,例如:

awk ' BEGIN{RS="\n";FS=" ";}/^!!G/{header=[=21=];print header}/[0-9]/{print }END{}' Scrap.log

然后我陷入了迭代行和字段以执行熔化步骤以及正确组合捕获组的问题。

如何组合所有这些部分以获取 CSV 格式?

我认为如下:

awk '
BEGIN{
    # output the header line
    print "HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value"
}
# ignore comment lines
/;/{next}

/!!G/{
    valcnt = 1
    # save and shuffle the values
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    val[valcnt++] = 
    next
}
/_START /{
    # these are headers - save them to head, to be reused later
    for (i = 2; i <= NF; ++i) {
        # fun fact: its indexed on NF
        head[i] = $i
    }
    next
}

# this function is redundant, but its just easier for me to think about the code
function output(firstval, header, value, \
        cur, i) {
    cur = valcnt
    val[cur++] = firstval
    val[cur++] = header
    val[cur++] = value
    # output val as csv
    for (i = 1; i < cur; ++i) {
        printf "%s%s", val[i], i != cur - 1 ? "," : "\n"
    }
}

/[0-9]+/{
    for (i = 2; i <= NF; ++i) {
        # add these 3 to all the other values and output them
        # ie. add first column, the header from header and the value
        output(, head[i], $i)
    }
}

'

应该输出你想要的。 Tested on repl.