找出一列中重复值出现次数最多的 n 次

Find the n largest occurrences of duplicate values in a column

我有一个 7GB 的 csv 文件,我需要从中找到主要出现在第 5 列中的前 n 个值。我的文件格式为:

"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","389","65311"
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","88","49194"
"instance id","src IP","destination ip","12489","49194"
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","63812","389"

使用以下命令 awk -F , ' !count[]++ {save[] = [=13=]; next} count[] > 0 { print save[] " is " count[] " times"} ' example.csv 能够找到以下输出

"instance id","src IP","destination ip","63812","389" is 1 times
"instance id","src IP","destination ip","389","65311" is 1 times
"instance id","src IP","destination ip","63812","389" is 2 times
"instance id","src IP","destination ip","88","49194" is 1 times
"instance id","src IP","destination ip","12489","49194" is 2 times
"instance id","src IP","destination ip","63812","389" is 3 times
"instance id","src IP","destination ip","63812","389" is 4 times

但无法理解如何获取第 5 列条目大部分时间重复的前 50 行。

假设 n=2,那么我的输出应该如下所示:

"instance id","src IP","destination ip","63812","389" is 4 times
"instance id","src IP","destination ip","12489","49194" is 2 times

如果我是你,我会将输出更改为:

1 times : "instance id","src IP","destination ip","63812","389" is 1 times
1 times : "instance id","src IP","destination ip","389","65311" is 1 times
2 times : "instance id","src IP","destination ip","63812","389" is 2 times
1 times : "instance id","src IP","destination ip","88","49169" is 1 times
2 times : "instance id","src IP","destination ip","12489","49194" is 2 times
3 times : "instance id","src IP","destination ip","63812","389" is 3 times
4 times : "instance id","src IP","destination ip","63812","389" is 4 times

然后在你的命令后面加上| sort -n

(我一直在尝试使用 sort -kx -n,但是双引号、逗号和空格的组合弄乱了 sort 命令的 -k 开关。)

此方法使用 GNU awk 扩展:

Return np:[= 中出现频率最高的值13=]

awk 'BEGIN{n=50; p=5; PROCINFO["sorted_in"]="@val_num_desc"} {a[$p]++}
     END { for(i in a) { if (!n--) { break }; print i } }' file

Return n 列中最频繁值的最后记录 p:

awk 'BEGIN{n=50; p=5; PROCINFO["sorted_in"]="@val_num_desc"} {a[$p]++;b[$p]=[=11=]}
     END { for(i in a) { if (!n--) { break }; print b[i] } }' file

将以上内容应用于 OP 的预期输出:

awk 'BEGIN{n=50; p=5; FS=","; PROCINFO["sorted_in"]="@val_num_desc"}
     {a[$p]++;b[$p]=[=12=]}
     END { for(i in a) { if (!n--) { break }; print b[i],"is",a[i],"times" } }' file

使用任何排序+awk+head:

$ sort -t, -k5,5 file |
    awk '
        BEGIN { FS=OFS="," }
         != p5 { if (NR>1) print p0, cnt; p5=; p0=[=10=]; cnt=0 }
        { cnt++ }
        END { print p0, cnt }' |
    sort -t, -k6,6rn |
    head -2
"instance id","src IP","destination ip","63812","389",4
"instance id","src IP","destination ip","12489","49194",2

如果您不需要使用 awk,command-line 工具 GoCSV 有许多 sub-commands 可以让您排在首位 n 个列的唯一值,按这些值的计数排序和剔除。

GoCSV 需要 header,因此对于您的输入,第一步是添加一些默认列名称(稍后可以删除):

gocsv cap -default-name 'Col' input.csv

有了 header,您可以将该输出通过管道传输到一系列命令,这些命令将:

  • 仅保留第 5 列中的唯一值,同时添加这些值的出现次数计数
  • 按计数降序排列(uniq 中的新第 6 列)
  • 然后只保留 50 行(“前 50 个计数”)
... \
| gocsv uniq -c 5 -count \
| gocsv sort -c 6 -reverse
| gocsv head -n 50

运行 我得到的全部:

Col 1,Col 2,Col 3,Col 4,Col 5,Count
instance id,src IP,destination ip,63812,389,4
instance id,src IP,destination ip,88,49194,2
instance id,src IP,destination ip,389,65311,1

要删除 header,只需将其通过管道输入 gocsv behead

pre-built platform/OS-es 个数。