创建一个 awk 文件以过滤掉数据集中不重复的行

create an awk file to filter out unduplicated lines of a dataset

我有以下数据集,我想实现一个迭代,在 awk 文件中逐行检查(awk 或 for),然后按以下方式执行它:

gawk -f file.awk dataset.csv

请允许我获取一个文件,其中的记录没有重复,最后一列中的浮点数四舍五入到两位小数。下面,我附上了我的数据集示例,如您所见,每个国家/地区应该只有一条记录。

Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.275057509
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.587215976
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.382270638
Angola,Angola,AGO,34654212,Africa,99194,1900,2862,55,1.915438434
Anguilla,Anguilla,AIA,15237,Latin America and the Caribbean,2700,9,177200,591,0.333333333
Antigua and Barbuda,Antigua and Barbuda,ATG,99348,Latin America and the Caribbean,7493,135,75422,1359,1.801681569
Argentina,Argentina,ARG,45921761,Latin America and the Caribbean,9041124,128065,196881,2789,1.416472111
Armenia,Armenia,ARM,2972939,Asia,422574,8617,142140,2898,2.039169471

由于我水平不高,代码长不介意,可以熟悉一下代码的步骤。


awk '{a[]++}END{for (i in a)if (a[i]>1)print i;}' file

我发现这个命令可以帮助实现这样的功能,它是一个 shell 脚本而不是 awk 脚本。

最后,作为指导,以下是根据发布的示例所需的输出,因为我的示例中没有重复项

Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.31
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.58
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.38
Angola,Angola,AGO,34654212,Africa,99194,1900,2862,55,1.91
Anguilla,Anguilla,AIA,15237,Latin America and the Caribbean,2700,9,177200,591,0.33
Antigua and Barbuda,Antigua and Barbuda,ATG,99348,Latin America and the Caribbean,7493,135,75422,1359,1.80
Argentina,Argentina,ARG,45921761,Latin America and the Caribbean,9041124,128065,196881,2789,1.41
Armenia,Armenia,ARM,2972939,Asia,422574,8617,142140,2898,2.03

提前感谢您的帮助

您的原代码:

awk '{a[]++}END{for (i in a)if (a[i]>1)print i;}' file

测试颠倒了:a[i]>1 应该是 a[i]==1 以仅打印唯一行。

实现将 n 截断为小数点后两位的一些方法是:

n = substr(n,1,match(n,/[.]/)+2)

n = sprintf("%0.2f",n)

所以你的脚本可以是:

BEGIN { FS=OFS="," } # delimit columns by comma
                     # csv must not have embedded commas

NR==1 {print; next} # print header

{  = sprintf("%0.2f", ) } # truncate column 10
                                # rewrites [=12=] so uses OFS

{ a[[=12=]]++ } # using [=12=] means entire line must be unique

END { for (i in a) if (a[i]==1) print i } # print unique lines

鉴于您对数据清理的评论,使用 two-pass 方法可能会更好:使用您的原始代码提醒您输入错误,然后在单独的传递中截断。

请注意,如果单个列发生更改,您将获得看似有效的输入。这些行不同:

Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Afghanistan,Afghanistan,AFG,40462106,Asia,177827,7671,4395,190,4.313743132

我想你想检测到这一点,所以你的健全性检查需要更复杂。