我怎样才能在 Awk 中做到这一点?我有几个文件有两列,第一列有相同的值。如何逐行计算第二列中的平均值?
How can I do this in Awk? I have several files with two columns and identical values in first column. How to average values in second column by row?
我是一个没有经验的 Awk 用户,但我知道 Awk 是处理许多文件的有效选择。如果有人能指出正确的方向,我将不胜感激。
我有一个名为 parent
的目录。其中有更多名为 1, 2, 3, 4, ...
的目录。在每个目录中都有一个名为 angles
的目录。在 angles
里面有一个名为 angle_A_B_C.dat
的文件,如下所示。
parent
1
angles
angle_A_B_C.dat
2
angles
angle_A_B_C.dat
3
angles
angle_A_B_C.dat
4
angles
angle_A_B_C.dat
...
文件 angle_A_B_C.dat
都具有相同的行数 (91) 和相同的第一列。只有第二列中的值是不同的。这是一个 angle_A_B_C.dat
文件的示例:
# Deg[°] Angle[A ,B ,C ]
1.000 0.0000000000
3.000 0.0000000000
5.000 0.0000000000
7.000 0.0000000000
9.000 0.0000000000
11.000 0.0000000000
13.000 0.0000000000
15.000 0.0000000000
17.000 0.0000000000
19.000 0.0000000000
21.000 0.0000000000
23.000 0.0000000000
25.000 0.0000000000
27.000 0.0000000000
29.000 0.0000000000
31.000 0.0000000000
33.000 0.0000000000
35.000 0.0000000000
37.000 0.0000000000
39.000 0.0000000000
41.000 0.0000000000
43.000 0.0000000000
45.000 0.0000000000
47.000 0.0000000000
49.000 0.0000000000
51.000 0.0000000000
53.000 0.0000000000
55.000 0.0000000000
57.000 0.0000000000
59.000 0.0000000000
61.000 0.0000000000
63.000 0.0000000000
65.000 0.0000000000
67.000 1.0309278351
69.000 1.0309278351
71.000 2.0618556701
73.000 1.0309278351
75.000 2.0618556701
77.000 0.0000000000
79.000 0.0000000000
81.000 4.1237113402
83.000 2.0618556701
85.000 4.1237113402
87.000 2.0618556701
89.000 2.0618556701
91.000 5.1546391753
93.000 3.0927835052
95.000 1.0309278351
97.000 3.0927835052
99.000 1.0309278351
101.000 2.0618556701
103.000 9.2783505155
105.000 7.2164948454
107.000 4.1237113402
109.000 5.1546391753
111.000 5.1546391753
113.000 3.0927835052
115.000 2.0618556701
117.000 9.2783505155
119.000 0.0000000000
121.000 3.0927835052
123.000 3.0927835052
125.000 2.0618556701
127.000 0.0000000000
129.000 1.0309278351
131.000 1.0309278351
133.000 2.0618556701
135.000 1.0309278351
137.000 0.0000000000
139.000 1.0309278351
141.000 0.0000000000
143.000 0.0000000000
145.000 1.0309278351
147.000 0.0000000000
149.000 0.0000000000
151.000 1.0309278351
153.000 0.0000000000
155.000 0.0000000000
157.000 1.0309278351
159.000 0.0000000000
161.000 0.0000000000
163.000 0.0000000000
165.000 0.0000000000
167.000 0.0000000000
169.000 0.0000000000
171.000 0.0000000000
173.000 0.0000000000
175.000 0.0000000000
177.000 0.0000000000
179.000 0.0000000000
我想生成一个名为 anglesSummary.txt
的文件,其中第一列与上面示例和所有 angle_A_B_C.dat
文件中的第一列相同,并且每一行第二列是所有其他文件的同一行的平均值。
我大致记得如何取位于不同目录中不同文件中的整个列的平均值,但无法弄清楚如何一次只处理一行。这可能吗?
这是我现在所在的位置;问号显示我认为我被困在哪里。
cd parent
find . -name angle_A_B_C.dat -exec grep "Angle[A ,B ,C ]" {} + > anglesSummary.txt
my_output="$(awk '{ total += ??? } END { print total/NR }' anglesSummary.txt)"
echo "Average: $my_output" >> anglesSummary.txt
更新(回复markp-fuso评论)
我想要的(请看第1列值为15.000那一行的注释):
# Deg[°] Angle[A ,B ,C ]
1.000 0.0000000000
3.000 0.0000000000
5.000 0.0000000000
7.000 0.0000000000
9.000 0.0000000000
11.000 0.0000000000
13.000 0.0000000000
15.000 1.2222220000 # <--Each row in column 2 is the average of the value in the corresponding row, column 2 in all files. So this particular value (1.222222) is the average of the values in all files where the column 1 value is 15.000.
17.000 0.0000000000
19.000 0.0000000000
21.000 0.0000000000
23.000 0.0000000000
25.000 0.0000000000
27.000 0.0000000000
29.000 0.0000000000
31.000 0.0000000000
33.000 0.0000000000
35.000 0.0000000000
... (truncated)
我目前从我的代码中得到的是每个 angle_A_B_C.dat 文件中第 2 列的平均值。
如果还有不明白的地方,请尽管说,我会重写。谢谢。
示例输入:
$ head */*/angle*
==> 1/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 0.3450000000
3.000 0.4560000000
5.000 0.7890000000
7.000 10.0000000000
9.000 20.0000000000
11.000 30.0000000000
13.000 40.0000000000
==> 2/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 7.3450000000
3.000 8.4560000000
5.000 9.7890000000
7.000 17.0000000000
9.000 27.0000000000
11.000 37.0000000000
13.000 47.0000000000
==> 3/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 0.9876000000
3.000 0.5432000000
5.000 0.2344560000
7.000 3.0000000000
9.000 4.0000000000
11.000 5.0000000000
13.000 6.0000000000
一个GNU awk
想法:
find . -name angle_A_B_C.dat -type f -exec awk '
NR==1 { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" } # 1st record of 1st file => print header
FNR==1 { filecount++; next } # 1st record of each new file => increment file counter; skip to next input line
NF==2 { sums[]+= } # sum up angles, use 1st column as array index
END { if (filecount>0) { # eliminate "divide by zero" error if no files found
PROCINFO["sorted_in"]="@ind_num_asc" # sort array by numeric index in ascending order
for (i in sums) # loop through array indices, printing index and average
printf "%.3f\t%.10f\n", i, sums[i]/filecount
}
}
' {} +
备注:
GNU awk
需要 PROCINFO["sorted_in"]
以允许以 # Deg[°]
升序生成输出(否则输出可以通过管道传输到 sort
以确保所需的顺序)
假设输入行已经排序:
find . -name angle_A_B_C.dat -type f -exec awk '
NR==1 { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" }
FNR==1 { filecount++; next }
NF==2 { col1[FNR]=; sums[FNR]+= }
END { if (filecount>0)
for (i=2;i<=FNR;i++)
printf "%.3f\t%.10f\n", col1[i], sums[i]/filecount
}
' {} +
备注:
- 应该运行在所有
awk
版本中(即不需要GNU awk
)
- 基于 jhnc 的评论(如果 jhnc 想要 post 一个单独的答案,我可以删除这部分答案)
这两个都会生成:
# Deg[°] Angle[A ,B ,C ]
1.000 2.8925333333
3.000 3.1517333333
5.000 3.6041520000
7.000 10.0000000000
9.000 17.0000000000
11.000 24.0000000000
13.000 31.0000000000
备注:
- 可以通过修改
printf
格式字符串 来调整输出格式以满足 OP 的喜好
使用 212 MB
级联输入文件的合成版本进行测试,假设有一点超过 76,000 individual files
,并在 2.23 seconds
中完成整个报告
此解决方案旨在通过将中间值存储在不超过 2^53
的无符号整数而不是 double-precision 浮点数中来最大程度地减少舍入误差,并使用最昂贵的字符串操作来防止不需要的 pre-conversion到浮点数。
它还使用 brute-force 方法来规避某些缺少 built-in 排序功能的 awk
的限制。好的一面是 - 输入文件中的行可以是任何混乱的顺序,这很好。
pvE0 < test.txt \
\
| mawk2 '
BEGIN {____=\
_^=_=(_+=++_+_)^--_+--_;_____=!(_=!_)
} {
if (/Deg/) { ___[$_]++ } else {
___[$_____]=$_
__[$_____]+=____*int(______=$NF)+\
substr(______,index(______,".")+_____)
} } END {
for(_ in ___) { if(index(_,"Deg")) {
______=___[_]
print _
break } }
_________=match(________=sprintf("%.*s",\
index(____,!____),____),/$/)
for(_=_~_;_-________<=(_________+________)*\
(_________+________);_++) {
if((_______=sprintf("%.*f",_________,_)) in __) {
_____=___[_______]
sub("[ \t]*[[:digit:]]+[.][[:digit:]]+$",
sprintf("%c%*.*f",--________,++________+_________,
________,__[_______]/______/____),_____)
print _____ } } }'
in0: 212MiB 0:00:02 [97.1MiB/s] [97.1MiB/s]
[=============================>] 100%
# Deg[°] Angle[A ,B ,C ]
1.000 7.5148221018
3.000 7.4967176419
5.000 7.5160005498
7.000 7.4793862628
9.000 7.5123479596
11.000 7.4791082935
13.000 7.4858962001
15.000 7.4941294148
17.000 7.5150168021
19.000 7.5067556155
21.000 7.5146136198
23.000 7.4792701433
25.000 7.4801382861
27.000 7.5026906476
29.000 7.4802267331
31.000 7.5216754387
33.000 7.4892379481
35.000 7.4905661773
37.000 7.4759338641
39.000 7.5130521094
41.000 7.4923359448
43.000 7.4680275394
45.000 7.5131741424
47.000 7.5022641880
49.000 7.4865545672
51.000 7.5280509182
53.000 7.4982720538
55.000 7.5082048446
57.000 7.5034726853
59.000 7.4978429619
61.000 7.5055566807
63.000 7.5108651984
65.000 7.5211276535
67.000 7.4875763176
69.000 7.4993074644
71.000 7.5124084003
73.000 7.5321662989
75.000 7.4859560680
77.000 7.4700932217
79.000 7.5121024268
81.000 7.5180572994
83.000 7.4938736294
85.000 7.5073566749
87.000 7.4917927829
89.000 7.5142626391
91.000 7.5223228551
93.000 7.5168014947
95.000 7.4757822101
97.000 7.5141328593
99.000 7.4863544344
101.000 7.5036731671
103.000 7.5200733708
105.000 7.4964541138
107.000 7.5050440318
109.000 7.4890049434
111.000 7.5045965882
113.000 7.5119613957
115.000 7.5050971735
117.000 7.4983417123
119.000 7.4867090870
121.000 7.5047947039
123.000 7.4837043078
125.000 7.4995212486
127.000 7.5111280706
129.000 7.5092771858
131.000 7.4977679060
133.000 7.5278372066
135.000 7.4794945181
137.000 7.5152681775
139.000 7.4954245649
141.000 7.5099441844
143.000 7.4945221883
145.000 7.4860083947
147.000 7.4848234307
149.000 7.4932545468
151.000 7.4937942058
153.000 7.4657789265
155.000 7.4947049961
157.000 7.5113607827
159.000 7.4978364461
161.000 7.5031970850
163.000 7.5017955073
165.000 7.5187543102
167.000 7.5064268609
169.000 7.4985988429
171.000 7.5438396243
173.000 7.4917706435
175.000 7.4589904950
177.000 7.5072644989
179.000 7.5176241959
我是一个没有经验的 Awk 用户,但我知道 Awk 是处理许多文件的有效选择。如果有人能指出正确的方向,我将不胜感激。
我有一个名为 parent
的目录。其中有更多名为 1, 2, 3, 4, ...
的目录。在每个目录中都有一个名为 angles
的目录。在 angles
里面有一个名为 angle_A_B_C.dat
的文件,如下所示。
parent
1
angles
angle_A_B_C.dat
2
angles
angle_A_B_C.dat
3
angles
angle_A_B_C.dat
4
angles
angle_A_B_C.dat
...
文件 angle_A_B_C.dat
都具有相同的行数 (91) 和相同的第一列。只有第二列中的值是不同的。这是一个 angle_A_B_C.dat
文件的示例:
# Deg[°] Angle[A ,B ,C ]
1.000 0.0000000000
3.000 0.0000000000
5.000 0.0000000000
7.000 0.0000000000
9.000 0.0000000000
11.000 0.0000000000
13.000 0.0000000000
15.000 0.0000000000
17.000 0.0000000000
19.000 0.0000000000
21.000 0.0000000000
23.000 0.0000000000
25.000 0.0000000000
27.000 0.0000000000
29.000 0.0000000000
31.000 0.0000000000
33.000 0.0000000000
35.000 0.0000000000
37.000 0.0000000000
39.000 0.0000000000
41.000 0.0000000000
43.000 0.0000000000
45.000 0.0000000000
47.000 0.0000000000
49.000 0.0000000000
51.000 0.0000000000
53.000 0.0000000000
55.000 0.0000000000
57.000 0.0000000000
59.000 0.0000000000
61.000 0.0000000000
63.000 0.0000000000
65.000 0.0000000000
67.000 1.0309278351
69.000 1.0309278351
71.000 2.0618556701
73.000 1.0309278351
75.000 2.0618556701
77.000 0.0000000000
79.000 0.0000000000
81.000 4.1237113402
83.000 2.0618556701
85.000 4.1237113402
87.000 2.0618556701
89.000 2.0618556701
91.000 5.1546391753
93.000 3.0927835052
95.000 1.0309278351
97.000 3.0927835052
99.000 1.0309278351
101.000 2.0618556701
103.000 9.2783505155
105.000 7.2164948454
107.000 4.1237113402
109.000 5.1546391753
111.000 5.1546391753
113.000 3.0927835052
115.000 2.0618556701
117.000 9.2783505155
119.000 0.0000000000
121.000 3.0927835052
123.000 3.0927835052
125.000 2.0618556701
127.000 0.0000000000
129.000 1.0309278351
131.000 1.0309278351
133.000 2.0618556701
135.000 1.0309278351
137.000 0.0000000000
139.000 1.0309278351
141.000 0.0000000000
143.000 0.0000000000
145.000 1.0309278351
147.000 0.0000000000
149.000 0.0000000000
151.000 1.0309278351
153.000 0.0000000000
155.000 0.0000000000
157.000 1.0309278351
159.000 0.0000000000
161.000 0.0000000000
163.000 0.0000000000
165.000 0.0000000000
167.000 0.0000000000
169.000 0.0000000000
171.000 0.0000000000
173.000 0.0000000000
175.000 0.0000000000
177.000 0.0000000000
179.000 0.0000000000
我想生成一个名为 anglesSummary.txt
的文件,其中第一列与上面示例和所有 angle_A_B_C.dat
文件中的第一列相同,并且每一行第二列是所有其他文件的同一行的平均值。
我大致记得如何取位于不同目录中不同文件中的整个列的平均值,但无法弄清楚如何一次只处理一行。这可能吗?
这是我现在所在的位置;问号显示我认为我被困在哪里。
cd parent
find . -name angle_A_B_C.dat -exec grep "Angle[A ,B ,C ]" {} + > anglesSummary.txt
my_output="$(awk '{ total += ??? } END { print total/NR }' anglesSummary.txt)"
echo "Average: $my_output" >> anglesSummary.txt
更新(回复markp-fuso评论)
我想要的(请看第1列值为15.000那一行的注释):
# Deg[°] Angle[A ,B ,C ]
1.000 0.0000000000
3.000 0.0000000000
5.000 0.0000000000
7.000 0.0000000000
9.000 0.0000000000
11.000 0.0000000000
13.000 0.0000000000
15.000 1.2222220000 # <--Each row in column 2 is the average of the value in the corresponding row, column 2 in all files. So this particular value (1.222222) is the average of the values in all files where the column 1 value is 15.000.
17.000 0.0000000000
19.000 0.0000000000
21.000 0.0000000000
23.000 0.0000000000
25.000 0.0000000000
27.000 0.0000000000
29.000 0.0000000000
31.000 0.0000000000
33.000 0.0000000000
35.000 0.0000000000
... (truncated)
我目前从我的代码中得到的是每个 angle_A_B_C.dat 文件中第 2 列的平均值。
如果还有不明白的地方,请尽管说,我会重写。谢谢。
示例输入:
$ head */*/angle*
==> 1/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 0.3450000000
3.000 0.4560000000
5.000 0.7890000000
7.000 10.0000000000
9.000 20.0000000000
11.000 30.0000000000
13.000 40.0000000000
==> 2/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 7.3450000000
3.000 8.4560000000
5.000 9.7890000000
7.000 17.0000000000
9.000 27.0000000000
11.000 37.0000000000
13.000 47.0000000000
==> 3/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 0.9876000000
3.000 0.5432000000
5.000 0.2344560000
7.000 3.0000000000
9.000 4.0000000000
11.000 5.0000000000
13.000 6.0000000000
一个GNU awk
想法:
find . -name angle_A_B_C.dat -type f -exec awk '
NR==1 { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" } # 1st record of 1st file => print header
FNR==1 { filecount++; next } # 1st record of each new file => increment file counter; skip to next input line
NF==2 { sums[]+= } # sum up angles, use 1st column as array index
END { if (filecount>0) { # eliminate "divide by zero" error if no files found
PROCINFO["sorted_in"]="@ind_num_asc" # sort array by numeric index in ascending order
for (i in sums) # loop through array indices, printing index and average
printf "%.3f\t%.10f\n", i, sums[i]/filecount
}
}
' {} +
备注:
GNU awk
需要PROCINFO["sorted_in"]
以允许以# Deg[°]
升序生成输出(否则输出可以通过管道传输到sort
以确保所需的顺序)
假设输入行已经排序:
find . -name angle_A_B_C.dat -type f -exec awk '
NR==1 { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" }
FNR==1 { filecount++; next }
NF==2 { col1[FNR]=; sums[FNR]+= }
END { if (filecount>0)
for (i=2;i<=FNR;i++)
printf "%.3f\t%.10f\n", col1[i], sums[i]/filecount
}
' {} +
备注:
- 应该运行在所有
awk
版本中(即不需要GNU awk
) - 基于 jhnc 的评论(如果 jhnc 想要 post 一个单独的答案,我可以删除这部分答案)
这两个都会生成:
# Deg[°] Angle[A ,B ,C ]
1.000 2.8925333333
3.000 3.1517333333
5.000 3.6041520000
7.000 10.0000000000
9.000 17.0000000000
11.000 24.0000000000
13.000 31.0000000000
备注:
- 可以通过修改
printf
格式字符串 来调整输出格式以满足 OP 的喜好
使用 212 MB
级联输入文件的合成版本进行测试,假设有一点超过 76,000 individual files
,并在 2.23 seconds
此解决方案旨在通过将中间值存储在不超过 2^53
的无符号整数而不是 double-precision 浮点数中来最大程度地减少舍入误差,并使用最昂贵的字符串操作来防止不需要的 pre-conversion到浮点数。
它还使用 brute-force 方法来规避某些缺少 built-in 排序功能的 awk
的限制。好的一面是 - 输入文件中的行可以是任何混乱的顺序,这很好。
pvE0 < test.txt \
\
| mawk2 '
BEGIN {____=\
_^=_=(_+=++_+_)^--_+--_;_____=!(_=!_)
} {
if (/Deg/) { ___[$_]++ } else {
___[$_____]=$_
__[$_____]+=____*int(______=$NF)+\
substr(______,index(______,".")+_____)
} } END {
for(_ in ___) { if(index(_,"Deg")) {
______=___[_]
print _
break } }
_________=match(________=sprintf("%.*s",\
index(____,!____),____),/$/)
for(_=_~_;_-________<=(_________+________)*\
(_________+________);_++) {
if((_______=sprintf("%.*f",_________,_)) in __) {
_____=___[_______]
sub("[ \t]*[[:digit:]]+[.][[:digit:]]+$",
sprintf("%c%*.*f",--________,++________+_________,
________,__[_______]/______/____),_____)
print _____ } } }'
in0: 212MiB 0:00:02 [97.1MiB/s] [97.1MiB/s]
[=============================>] 100%
# Deg[°] Angle[A ,B ,C ]
1.000 7.5148221018
3.000 7.4967176419
5.000 7.5160005498
7.000 7.4793862628
9.000 7.5123479596
11.000 7.4791082935
13.000 7.4858962001
15.000 7.4941294148
17.000 7.5150168021
19.000 7.5067556155
21.000 7.5146136198
23.000 7.4792701433
25.000 7.4801382861
27.000 7.5026906476
29.000 7.4802267331
31.000 7.5216754387
33.000 7.4892379481
35.000 7.4905661773
37.000 7.4759338641
39.000 7.5130521094
41.000 7.4923359448
43.000 7.4680275394
45.000 7.5131741424
47.000 7.5022641880
49.000 7.4865545672
51.000 7.5280509182
53.000 7.4982720538
55.000 7.5082048446
57.000 7.5034726853
59.000 7.4978429619
61.000 7.5055566807
63.000 7.5108651984
65.000 7.5211276535
67.000 7.4875763176
69.000 7.4993074644
71.000 7.5124084003
73.000 7.5321662989
75.000 7.4859560680
77.000 7.4700932217
79.000 7.5121024268
81.000 7.5180572994
83.000 7.4938736294
85.000 7.5073566749
87.000 7.4917927829
89.000 7.5142626391
91.000 7.5223228551
93.000 7.5168014947
95.000 7.4757822101
97.000 7.5141328593
99.000 7.4863544344
101.000 7.5036731671
103.000 7.5200733708
105.000 7.4964541138
107.000 7.5050440318
109.000 7.4890049434
111.000 7.5045965882
113.000 7.5119613957
115.000 7.5050971735
117.000 7.4983417123
119.000 7.4867090870
121.000 7.5047947039
123.000 7.4837043078
125.000 7.4995212486
127.000 7.5111280706
129.000 7.5092771858
131.000 7.4977679060
133.000 7.5278372066
135.000 7.4794945181
137.000 7.5152681775
139.000 7.4954245649
141.000 7.5099441844
143.000 7.4945221883
145.000 7.4860083947
147.000 7.4848234307
149.000 7.4932545468
151.000 7.4937942058
153.000 7.4657789265
155.000 7.4947049961
157.000 7.5113607827
159.000 7.4978364461
161.000 7.5031970850
163.000 7.5017955073
165.000 7.5187543102
167.000 7.5064268609
169.000 7.4985988429
171.000 7.5438396243
173.000 7.4917706435
175.000 7.4589904950
177.000 7.5072644989
179.000 7.5176241959