我怎样才能在 Awk 中做到这一点?我有几个文件有两列,第一列有相同的值。如何逐行计算第二列中的平均值?

How can I do this in Awk? I have several files with two columns and identical values in first column. How to average values in second column by row?

我是一个没有经验的 Awk 用户,但我知道 Awk 是处理许多文件的有效选择。如果有人能指出正确的方向,我将不胜感激。

我有一个名为 parent 的目录。其中有更多名为 1, 2, 3, 4, ... 的目录。在每个目录中都有一个名为 angles 的目录。在 angles 里面有一个名为 angle_A_B_C.dat 的文件,如下所示。

parent
  1
     angles
       angle_A_B_C.dat
  2
     angles
       angle_A_B_C.dat
  3
     angles
       angle_A_B_C.dat
  4
     angles
       angle_A_B_C.dat
  ...

文件 angle_A_B_C.dat 都具有相同的行数 (91) 和相同的第一列。只有第二列中的值是不同的。这是一个 angle_A_B_C.dat 文件的示例:

# Deg[°]         Angle[A ,B ,C ] 
     1.000        0.0000000000
     3.000        0.0000000000
     5.000        0.0000000000
     7.000        0.0000000000
     9.000        0.0000000000
    11.000        0.0000000000
    13.000        0.0000000000
    15.000        0.0000000000
    17.000        0.0000000000
    19.000        0.0000000000
    21.000        0.0000000000
    23.000        0.0000000000
    25.000        0.0000000000
    27.000        0.0000000000
    29.000        0.0000000000
    31.000        0.0000000000
    33.000        0.0000000000
    35.000        0.0000000000
    37.000        0.0000000000
    39.000        0.0000000000
    41.000        0.0000000000
    43.000        0.0000000000
    45.000        0.0000000000
    47.000        0.0000000000
    49.000        0.0000000000
    51.000        0.0000000000
    53.000        0.0000000000
    55.000        0.0000000000
    57.000        0.0000000000
    59.000        0.0000000000
    61.000        0.0000000000
    63.000        0.0000000000
    65.000        0.0000000000
    67.000        1.0309278351
    69.000        1.0309278351
    71.000        2.0618556701
    73.000        1.0309278351
    75.000        2.0618556701
    77.000        0.0000000000
    79.000        0.0000000000
    81.000        4.1237113402
    83.000        2.0618556701
    85.000        4.1237113402
    87.000        2.0618556701
    89.000        2.0618556701
    91.000        5.1546391753
    93.000        3.0927835052
    95.000        1.0309278351
    97.000        3.0927835052
    99.000        1.0309278351
   101.000        2.0618556701
   103.000        9.2783505155
   105.000        7.2164948454
   107.000        4.1237113402
   109.000        5.1546391753
   111.000        5.1546391753
   113.000        3.0927835052
   115.000        2.0618556701
   117.000        9.2783505155
   119.000        0.0000000000
   121.000        3.0927835052
   123.000        3.0927835052
   125.000        2.0618556701
   127.000        0.0000000000
   129.000        1.0309278351
   131.000        1.0309278351
   133.000        2.0618556701
   135.000        1.0309278351
   137.000        0.0000000000
   139.000        1.0309278351
   141.000        0.0000000000
   143.000        0.0000000000
   145.000        1.0309278351
   147.000        0.0000000000
   149.000        0.0000000000
   151.000        1.0309278351
   153.000        0.0000000000
   155.000        0.0000000000
   157.000        1.0309278351
   159.000        0.0000000000
   161.000        0.0000000000
   163.000        0.0000000000
   165.000        0.0000000000
   167.000        0.0000000000
   169.000        0.0000000000
   171.000        0.0000000000
   173.000        0.0000000000
   175.000        0.0000000000
   177.000        0.0000000000
   179.000        0.0000000000

我想生成一个名为 anglesSummary.txt 的文件,其中第一列与上面示例和所有 angle_A_B_C.dat 文件中的第一列相同,并且每一行第二列是所有其他文件的同一行的平均值。

我大致记得如何取位于不同目录中不同文件中的整个列的平均值,但无法弄清楚如何一次只处理一行。这可能吗?

这是我现在所在的位置;问号显示我认为我被困在哪里。

cd parent
find . -name angle_A_B_C.dat -exec grep "Angle[A ,B ,C ]" {} + > anglesSummary.txt
my_output="$(awk '{ total += ??? } END { print total/NR }' anglesSummary.txt)"
echo "Average: $my_output" >> anglesSummary.txt

更新(回复markp-fuso评论)

我想要的(请看第1列值为15.000那一行的注释):

# Deg[°]         Angle[A ,B ,C ] 
     1.000        0.0000000000
     3.000        0.0000000000
     5.000        0.0000000000
     7.000        0.0000000000
     9.000        0.0000000000
    11.000        0.0000000000
    13.000        0.0000000000
    15.000        1.2222220000 # <--Each row in column 2 is the average of the value in the corresponding row, column 2 in all files. So this particular value (1.222222) is the average of the values in all files where the column 1 value is 15.000.
    17.000        0.0000000000
    19.000        0.0000000000
    21.000        0.0000000000
    23.000        0.0000000000
    25.000        0.0000000000
    27.000        0.0000000000
    29.000        0.0000000000
    31.000        0.0000000000
    33.000        0.0000000000
    35.000        0.0000000000
    ... (truncated)

我目前从我的代码中得到的是每个 angle_A_B_C.dat 文件中第 2 列的平均值。

如果还有不明白的地方,请尽管说,我会重写。谢谢。

示例输入:

$ head */*/angle*
==> 1/angles/angle_A_B_C.dat <==
# Deg[°]         Angle[A ,B ,C ]
     1.000        0.3450000000
     3.000        0.4560000000
     5.000        0.7890000000
     7.000        10.0000000000
     9.000        20.0000000000
    11.000        30.0000000000
    13.000        40.0000000000

==> 2/angles/angle_A_B_C.dat <==
# Deg[°]         Angle[A ,B ,C ]
     1.000        7.3450000000
     3.000        8.4560000000
     5.000        9.7890000000
     7.000        17.0000000000
     9.000        27.0000000000
    11.000        37.0000000000
    13.000        47.0000000000

==> 3/angles/angle_A_B_C.dat <==
# Deg[°]         Angle[A ,B ,C ]
     1.000        0.9876000000
     3.000        0.5432000000
     5.000        0.2344560000
     7.000        3.0000000000
     9.000        4.0000000000
    11.000        5.0000000000
    13.000        6.0000000000

一个GNU awk想法:

find . -name angle_A_B_C.dat -type f -exec awk '
NR==1   { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" }   # 1st record of 1st file => print header
FNR==1  { filecount++; next }                                 # 1st record of each new file => increment file counter; skip to next input line
NF==2   { sums[]+= }                                      # sum up angles, use 1st column as array index
END     { if (filecount>0) {                                  # eliminate "divide by zero" error if no files found
              PROCINFO["sorted_in"]="@ind_num_asc"            # sort array by numeric index in ascending order
              for (i in sums)                                 # loop through array indices, printing index and average
                  printf "%.3f\t%.10f\n", i, sums[i]/filecount
          }
        }
' {} +

备注:

  • GNU awk 需要 PROCINFO["sorted_in"] 以允许以 # Deg[°] 升序生成输出(否则输出可以通过管道传输到 sort 以确保所需的顺序)

假设输入行已经排序:

find . -name angle_A_B_C.dat -type f -exec awk '
NR==1   { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" }
FNR==1  { filecount++; next }
NF==2   { col1[FNR]=;  sums[FNR]+= }
END     { if (filecount>0)
              for (i=2;i<=FNR;i++)
                  printf "%.3f\t%.10f\n", col1[i], sums[i]/filecount
        }
' {} +

备注:

  • 应该运行在所有awk版本中(即不需要GNU awk
  • 基于 jhnc 的评论(如果 jhnc 想要 post 一个单独的答案,我可以删除这部分答案)

这两个都会生成:

# Deg[°]         Angle[A ,B ,C ]
1.000   2.8925333333
3.000   3.1517333333
5.000   3.6041520000
7.000   10.0000000000
9.000   17.0000000000
11.000  24.0000000000
13.000  31.0000000000

备注:

  • 可以通过修改 printf 格式字符串
  • 来调整输出格式以满足 OP 的喜好

使用 212 MB 级联输入文件的合成版本进行测试,假设有一点超过 76,000 individual files,并在 2.23 seconds

中完成整个报告

此解决方案旨在通过将中间值存储在不超过 2^53 的无符号整数而不是 double-precision 浮点数中来最大程度地减少舍入误差,并使用最昂贵的字符串操作来防止不需要的 pre-conversion到浮点数。

它还使用 brute-force 方法来规避某些缺少 built-in 排序功能的 awk 的限制。好的一面是 - 输入文件中的行可以是任何混乱的顺序,这很好。

pvE0 <  test.txt \
\
| mawk2 '
  BEGIN {____=\
         _^=_=(_+=++_+_)^--_+--_;_____=!(_=!_)
   } {
      if (/Deg/) { ___[$_]++ } else {
           ___[$_____]=$_
           __[$_____]+=____*int(______=$NF)+\
                         substr(______,index(______,".")+_____) 
   } } END { 
     for(_ in ___) { if(index(_,"Deg")) { 
         ______=___[_]
         print _
            break } }
      _________=match(________=sprintf("%.*s",\
                      index(____,!____),____),/$/) 
     for(_=_~_;_-________<=(_________+________)*\
                         (_________+________);_++) {
       if((_______=sprintf("%.*f",_________,_)) in __) { 
           _____=___[_______]
                 sub("[ \t]*[[:digit:]]+[.][[:digit:]]+$",        
          sprintf("%c%*.*f",--________,++________+_________,
                    ________,__[_______]/______/____),_____) 
        print _____ } } }' 



  in0:  212MiB 0:00:02 [97.1MiB/s] [97.1MiB/s] 
  [=============================>] 100%   
     
 # Deg[°]         Angle[A ,B ,C ] 
     1.000       7.5148221018
     3.000       7.4967176419
     5.000       7.5160005498
     7.000       7.4793862628
     9.000       7.5123479596
    11.000       7.4791082935
    13.000       7.4858962001
    15.000       7.4941294148
    17.000       7.5150168021
    19.000       7.5067556155
    21.000       7.5146136198
    23.000       7.4792701433
    25.000       7.4801382861
    27.000       7.5026906476
    29.000       7.4802267331
    31.000       7.5216754387
    33.000       7.4892379481
    35.000       7.4905661773
    37.000       7.4759338641
    39.000       7.5130521094
    41.000       7.4923359448
    43.000       7.4680275394
    45.000       7.5131741424
    47.000       7.5022641880
    49.000       7.4865545672
    51.000       7.5280509182
    53.000       7.4982720538
    55.000       7.5082048446
    57.000       7.5034726853
    59.000       7.4978429619
    61.000       7.5055566807
    63.000       7.5108651984
    65.000       7.5211276535
    67.000       7.4875763176
    69.000       7.4993074644
    71.000       7.5124084003
    73.000       7.5321662989
    75.000       7.4859560680
    77.000       7.4700932217
    79.000       7.5121024268
    81.000       7.5180572994
    83.000       7.4938736294
    85.000       7.5073566749
    87.000       7.4917927829
    89.000       7.5142626391
    91.000       7.5223228551
    93.000       7.5168014947
    95.000       7.4757822101
    97.000       7.5141328593
    99.000       7.4863544344
   101.000       7.5036731671
   103.000       7.5200733708
   105.000       7.4964541138
   107.000       7.5050440318
   109.000       7.4890049434
   111.000       7.5045965882
   113.000       7.5119613957
   115.000       7.5050971735
   117.000       7.4983417123
   119.000       7.4867090870
   121.000       7.5047947039
   123.000       7.4837043078
   125.000       7.4995212486
   127.000       7.5111280706
   129.000       7.5092771858
   131.000       7.4977679060
   133.000       7.5278372066
   135.000       7.4794945181
   137.000       7.5152681775
   139.000       7.4954245649
   141.000       7.5099441844
   143.000       7.4945221883
   145.000       7.4860083947
   147.000       7.4848234307
   149.000       7.4932545468
   151.000       7.4937942058
   153.000       7.4657789265
   155.000       7.4947049961
   157.000       7.5113607827
   159.000       7.4978364461
   161.000       7.5031970850
   163.000       7.5017955073
   165.000       7.5187543102
   167.000       7.5064268609
   169.000       7.4985988429
   171.000       7.5438396243
   173.000       7.4917706435
   175.000       7.4589904950
   177.000       7.5072644989
   179.000       7.5176241959