如何根据第 1 列汇总所有其他列？

Question

我有一个示例 csv 文件，如下所示（但更多的列编号最多为示例 100 和几行）

Genus,Sample1,Sample2,Sample3
Unclassified,0,1,44
Unclassified,0,0,392
Unclassified,0,0,0
Woeseia,0,0,76

我想要一个汇总的 csv 文件，如下所示，其中汇总了第 1 列中所有相同的条目

Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

我尝试了以下 awk 代码，但没有成功

awk  -F "," 'function SP()  {n=split ([=12=], T); ID=}
         function PR()  {printf "%s", ID; for (i=2; i<=n; i++) printf "\t%s", T[i]; printf "\n"}

         NR==1          {SP();next}
          != ID       {PR(); SP(); next}
                        {for (i=2; i<=NF; i++) T[i]+=$i}
         END            {PR()}
        ' Filename.csv

我也知道做类似下面的事情，但是当有数百列时这是不切实际的。如有任何帮助，我们将不胜感激。

awk -F "," ' NR==1 {print; next} NF {a[]+=; b[]+=; c[]+=; d[]+=; e[]+=; f[]++} END {for(i in a)print i, a[i], b[i], c[i], d[i], e[i], f[i]} ' Filename.csv

Answer 1

使用您显示的示例，请尝试执行以下 awk 程序。您无需创建这么多数组，您可以在此处轻松创建 1 或 2 个数组。

awk '
BEGIN { FS=OFS="," }
FNR==1{
  print
  next
}
{
  for(i=2;i<=NF;i++){
    arr1[]
    arr2[,i]+=$i
  }
}
END{
  for(i in arr1){
    printf("%s,",i)
    for(j=2;j<=NF;j++){
      printf("%s%s",arr2[i,j],j==NF?ORS:OFS)
    }
  }
}
'  Input_file

输出如下：

Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

说明：为以上代码添加详细说明。

awk '                         ##Starting awk program from here.
BEGIN { FS=OFS="," }          ##In BEGIN section setting FS and OFS as comma here.
FNR==1{                       ##Checking if this is first line then do following.
  print                       ##Printing current line.
  next                        ##next will skip further statements from here.
}
{
  for(i=2;i<=NF;i++){         ##Running for loop from 2nd field to till NF here.
    arr1[]                  ##Creating arr1 array with index of 1st field.
    arr2[,i]+=$i            ##Creating arr2 with index of 1st field and current field number and value is current field value which is keep adding into it.
  }
}
END{                          ##Starting END block for this program from here.
  for(i in arr1){             ##Traversing through arr1 all elements here one by one.
    printf("%s,",i)           ##Printing its current index here.
    for(j=2;j<=NF;j++){       ##Running for loop from 2nd field to till NF here.
      printf("%s%s",arr2[i,j],j==NF?ORS:OFS) ##Printing value of arr2 with index of i and j, printing new line if its last field.
    }
  }
}
'  Input_file                 ##Mentioning Input_file here.

Answer 2

这是另一个 awk:

awk -v FS=',' -v OFS=',' '
    NR == 1 {
        print
        next
    }
    {
        ids[]
        for (i = 2; i <= NF; i++)
            sums[i "," ] += $i
    }
    END {
        for (id in ids) {
            out = id
            for (i = 2; i <= NF; i++)
                out = out OFS sums[i "," id]
            print out
        }
    }
' Filename.csv

Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

您还可以使用提供数据分析工具的 CSV-aware 程序。
这是 Miller, which is available as a stand-alone executable:

的示例

IFS='' read -r csv_header < Filename.csv

mlr --csv \
    stats1 -a sum -g "${csv_header%%,*}" -f "${csv_header#*,}" \
    then rename -r '(.*)_sum,' \
    Filename.csv

Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

如何根据第 1 列汇总所有其他列？

How to sum up all other columns based on column 1?

awk