将相关数据行分组到 Linux 中的单个列
Grouping related rows of data into a single column in Linux
我有一个每天自动生成的 csv 文件,其输出类似于以下示例:
"N","3.5",3,"Bob","10/29/17"
"Y","4.5",5,"Bob","10/11/18"
"Y","5",6,"Bob","10/28/18"
"Y","3",1,"Jim",
"N","4",2,"Jim","09/29/17"
"N","2.5",4,"Joe","01/26/18"
我需要转换文本,使其按人分组(第四列),所有记录都在一行中,并使用相同的顺序重复列中的内容:1,2,3 ,5.某些单元格可能缺少数据,但必须保留在序列中,以便列对齐。所以我需要的输出将如下所示:
"Bob","N","3.5",3,"10/29/17","Y","4.5",5,"10/11/18","Y","5",6,"10/28/18"
"Jim","Y","3",1,,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
我愿意使用 sed、awk 或几乎所有标准 Linux 命令来完成此任务。我一直在尝试使用 awk,虽然我接近了,但我不知道如何完成它。
这是我接近的命令。它列出了 header 和名称,但没有其他数据:
awk -F"," 'NR==1; NR>1 {a[]=a[] ? i : ""} END {for (i in a) {print i}}' test2.csv
您需要更多代码
$ awk 'BEGIN {FS=OFS=","}
{k=; =; NF--; a[k]=(k in a?a[k] FS [=10=]:[=10=])}
END {for(k in a) print k,a[k]}' file
"Bob","N","3.5",3,"10/29/17" ,"Y","4.5",5,"10/11/18" ,"Y","5",6,"10/28/18"
"Jim","Y","3",1, ,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
请注意,NF--
技巧可能不适用于所有 awk
s。
能否请您也尝试跟随,阅读 Input_file 2 次,它将提供与第 4 列相同顺序的输出 Input_file。
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[]=a[]?a[] OFS OFS OFS OFS : OFS OFS OFS OFS
next
}
a[]{
print a[]
delete a[]
}
' Input_file Input_file
如果任何 CSV 值有可能包含逗号,那么建议使用 "CSV-aware" 工具以获得可靠但直接的解决方案。
一种方法是使用许多现成的 csv2tsv 命令行工具之一。然后,各种优雅的解决方案成为可能。例如,可以将 CSV 通过管道传输到 csv2tsv、awk 和 tsv2csv。
这是另一个使用 csv2tsv 和 jq 的解决方案:
csv2tsv < input.csv | jq -Rrn '
[inputs | split("\t")]
| group_by(.[3])[]
| sort_by(.[2])
| [.[0][3]] + ( map( del(.[3])) | add)
| @csv
'
这会产生:
"Bob","N","3.5","3","10/29/17 ","Y","4.5","5","10/11/18 ","Y","5","6","10/28/18 "
"Jim","Y","3","1"," ","N","4","2","09/29/17 "
"Joe","N","2.5","4","01/26/18"
修剪多余的空间留作练习:-)
我有一个每天自动生成的 csv 文件,其输出类似于以下示例:
"N","3.5",3,"Bob","10/29/17"
"Y","4.5",5,"Bob","10/11/18"
"Y","5",6,"Bob","10/28/18"
"Y","3",1,"Jim",
"N","4",2,"Jim","09/29/17"
"N","2.5",4,"Joe","01/26/18"
我需要转换文本,使其按人分组(第四列),所有记录都在一行中,并使用相同的顺序重复列中的内容:1,2,3 ,5.某些单元格可能缺少数据,但必须保留在序列中,以便列对齐。所以我需要的输出将如下所示:
"Bob","N","3.5",3,"10/29/17","Y","4.5",5,"10/11/18","Y","5",6,"10/28/18"
"Jim","Y","3",1,,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
我愿意使用 sed、awk 或几乎所有标准 Linux 命令来完成此任务。我一直在尝试使用 awk,虽然我接近了,但我不知道如何完成它。
这是我接近的命令。它列出了 header 和名称,但没有其他数据:
awk -F"," 'NR==1; NR>1 {a[]=a[] ? i : ""} END {for (i in a) {print i}}' test2.csv
您需要更多代码
$ awk 'BEGIN {FS=OFS=","}
{k=; =; NF--; a[k]=(k in a?a[k] FS [=10=]:[=10=])}
END {for(k in a) print k,a[k]}' file
"Bob","N","3.5",3,"10/29/17" ,"Y","4.5",5,"10/11/18" ,"Y","5",6,"10/28/18"
"Jim","Y","3",1, ,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
请注意,NF--
技巧可能不适用于所有 awk
s。
能否请您也尝试跟随,阅读 Input_file 2 次,它将提供与第 4 列相同顺序的输出 Input_file。
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[]=a[]?a[] OFS OFS OFS OFS : OFS OFS OFS OFS
next
}
a[]{
print a[]
delete a[]
}
' Input_file Input_file
如果任何 CSV 值有可能包含逗号,那么建议使用 "CSV-aware" 工具以获得可靠但直接的解决方案。
一种方法是使用许多现成的 csv2tsv 命令行工具之一。然后,各种优雅的解决方案成为可能。例如,可以将 CSV 通过管道传输到 csv2tsv、awk 和 tsv2csv。
这是另一个使用 csv2tsv 和 jq 的解决方案:
csv2tsv < input.csv | jq -Rrn '
[inputs | split("\t")]
| group_by(.[3])[]
| sort_by(.[2])
| [.[0][3]] + ( map( del(.[3])) | add)
| @csv
'
这会产生:
"Bob","N","3.5","3","10/29/17 ","Y","4.5","5","10/11/18 ","Y","5","6","10/28/18 "
"Jim","Y","3","1"," ","N","4","2","09/29/17 "
"Joe","N","2.5","4","01/26/18"
修剪多余的空间留作练习:-)