awk - 计算包含另一个特定字段的行中某个字段的出现次数

Question

我有以下数据结构：

apples    yellow
apples    yellow
apples    green
apples    green
apples    green

grapes    yellow
grapes    yellow
grapes    yellow
grapes    green

lemons    yellow
lemons    green
lemons    green

重要提示：我事先不知道我的清单包含苹果、葡萄和柠檬。如果我需要计算 </code> 是 <code>yellow 的次数，然后在旁边显示 </code> 和 <code>yellow 计数的次数，我可以使用 GNU AWK 执行此操作：

awk '=="yellow" {yellowfruit[]++} END {for (fruit in yellowfruit) print fruit,yellowfruit[fruit]}'

...并得到预期结果：

grapes 3
lemons 1
apples 2

如何添加另一列来计算每种水果类型的绿色出现次数？我不能做 for (fruit in yellowfruit,greenfruit) 或喜欢 bash: for (fruit in yellowfruit greenfruit)

Answer 1

你可以更通用，像这样处理任意数量的未知 color/fruit 对：

awk '{if(NF==2){fruit[][]++}} END{for(color in fruit){for(type in fruit[color]){print color " " type " " fruit[color][type]}}}'

这将给出以下输出：

yellow lemons 1
yellow apples 2
yellow grapes 3
green lemons 2
green apples 3
green grapes 1

如果你想要更多的矩阵样式，你可以添加一个额外的数组来跟踪可用的颜色并使用 printf 代替 print:

awk '{ if(NF==2){fruit[][]++; colors[]=}} END{printf("type");for(color in colors){printf("\t%s",colors[color])};printf("\n"); for(type in fruit){printf("%s",type);for(color in fruit[type]){ printf("\t%d",fruit[type][color]) }printf("\n")}}'

给出：

type    yellow  green
lemons  1       2
apples  2       3
grapes  3       1

有点乱，不在意表头的可以简化一下：

awk '{if(NF==2){fruit[][]++;}} END{for(type in fruit){printf("%s",type);for(color in fruit[type]){printf("\t%d",fruit[type][color]) }printf("\n")}}'

将给予：

lemons  1       2
apples  2       3
grapes  3       1

Answer 2

我不久前找到了我的答案，但一直没来得及 post 在这里。只需要一个for循环，条件语句更清晰。

awk '{
all[]++
if (=="yellow") yellowfruit[]++
else if (="green") greenfruit[]++} END {for (fruit in all) print fruit,yellowfruit[fruit],greenfruit[fruit]}'

结果：

grapes 3 1
lemons 1 2
apples 2 3

awk - 计算包含另一个特定字段的行中某个字段的出现次数

awk - count number of occurences for a field in a line containing another specific field

statistics

for-loop

gawk