创建一个 awk 文件以过滤掉数据集中不重复的行
create an awk file to filter out unduplicated lines of a dataset
我有以下数据集,我想实现一个迭代,在 awk 文件中逐行检查(awk 或 for),然后按以下方式执行它:
gawk -f file.awk dataset.csv
请允许我获取一个文件,其中的记录没有重复,最后一列中的浮点数四舍五入到两位小数。下面,我附上了我的数据集示例,如您所见,每个国家/地区应该只有一条记录。
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.275057509
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.587215976
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.382270638
Angola,Angola,AGO,34654212,Africa,99194,1900,2862,55,1.915438434
Anguilla,Anguilla,AIA,15237,Latin America and the Caribbean,2700,9,177200,591,0.333333333
Antigua and Barbuda,Antigua and Barbuda,ATG,99348,Latin America and the Caribbean,7493,135,75422,1359,1.801681569
Argentina,Argentina,ARG,45921761,Latin America and the Caribbean,9041124,128065,196881,2789,1.416472111
Armenia,Armenia,ARM,2972939,Asia,422574,8617,142140,2898,2.039169471
由于我水平不高,代码长不介意,可以熟悉一下代码的步骤。
awk '{a[]++}END{for (i in a)if (a[i]>1)print i;}' file
我发现这个命令可以帮助实现这样的功能,它是一个 shell 脚本而不是 awk 脚本。
最后,作为指导,以下是根据发布的示例所需的输出,因为我的示例中没有重复项
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.31
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.58
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.38
Angola,Angola,AGO,34654212,Africa,99194,1900,2862,55,1.91
Anguilla,Anguilla,AIA,15237,Latin America and the Caribbean,2700,9,177200,591,0.33
Antigua and Barbuda,Antigua and Barbuda,ATG,99348,Latin America and the Caribbean,7493,135,75422,1359,1.80
Argentina,Argentina,ARG,45921761,Latin America and the Caribbean,9041124,128065,196881,2789,1.41
Armenia,Armenia,ARM,2972939,Asia,422574,8617,142140,2898,2.03
提前感谢您的帮助
您的原代码:
awk '{a[]++}END{for (i in a)if (a[i]>1)print i;}' file
测试颠倒了:a[i]>1
应该是 a[i]==1
以仅打印唯一行。
实现将 n
截断为小数点后两位的一些方法是:
n = substr(n,1,match(n,/[.]/)+2)
n = sprintf("%0.2f",n)
所以你的脚本可以是:
BEGIN { FS=OFS="," } # delimit columns by comma
# csv must not have embedded commas
NR==1 {print; next} # print header
{ = sprintf("%0.2f", ) } # truncate column 10
# rewrites [=12=] so uses OFS
{ a[[=12=]]++ } # using [=12=] means entire line must be unique
END { for (i in a) if (a[i]==1) print i } # print unique lines
鉴于您对数据清理的评论,使用 two-pass 方法可能会更好:使用您的原始代码提醒您输入错误,然后在单独的传递中截断。
请注意,如果单个列发生更改,您将获得看似有效的输入。这些行不同:
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Afghanistan,Afghanistan,AFG,40462106,Asia,177827,7671,4395,190,4.313743132
我想你想检测到这一点,所以你的健全性检查需要更复杂。
我有以下数据集,我想实现一个迭代,在 awk 文件中逐行检查(awk 或 for),然后按以下方式执行它:
gawk -f file.awk dataset.csv
请允许我获取一个文件,其中的记录没有重复,最后一列中的浮点数四舍五入到两位小数。下面,我附上了我的数据集示例,如您所见,每个国家/地区应该只有一条记录。
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.275057509
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.587215976
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.382270638
Angola,Angola,AGO,34654212,Africa,99194,1900,2862,55,1.915438434
Anguilla,Anguilla,AIA,15237,Latin America and the Caribbean,2700,9,177200,591,0.333333333
Antigua and Barbuda,Antigua and Barbuda,ATG,99348,Latin America and the Caribbean,7493,135,75422,1359,1.801681569
Argentina,Argentina,ARG,45921761,Latin America and the Caribbean,9041124,128065,196881,2789,1.416472111
Armenia,Armenia,ARM,2972939,Asia,422574,8617,142140,2898,2.039169471
由于我水平不高,代码长不介意,可以熟悉一下代码的步骤。
awk '{a[]++}END{for (i in a)if (a[i]>1)print i;}' file
我发现这个命令可以帮助实现这样的功能,它是一个 shell 脚本而不是 awk 脚本。
最后,作为指导,以下是根据发布的示例所需的输出,因为我的示例中没有重复项
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.31
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.58
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.38
Angola,Angola,AGO,34654212,Africa,99194,1900,2862,55,1.91
Anguilla,Anguilla,AIA,15237,Latin America and the Caribbean,2700,9,177200,591,0.33
Antigua and Barbuda,Antigua and Barbuda,ATG,99348,Latin America and the Caribbean,7493,135,75422,1359,1.80
Argentina,Argentina,ARG,45921761,Latin America and the Caribbean,9041124,128065,196881,2789,1.41
Armenia,Armenia,ARM,2972939,Asia,422574,8617,142140,2898,2.03
提前感谢您的帮助
您的原代码:
awk '{a[]++}END{for (i in a)if (a[i]>1)print i;}' file
测试颠倒了:a[i]>1
应该是 a[i]==1
以仅打印唯一行。
实现将 n
截断为小数点后两位的一些方法是:
n = substr(n,1,match(n,/[.]/)+2)
n = sprintf("%0.2f",n)
所以你的脚本可以是:
BEGIN { FS=OFS="," } # delimit columns by comma
# csv must not have embedded commas
NR==1 {print; next} # print header
{ = sprintf("%0.2f", ) } # truncate column 10
# rewrites [=12=] so uses OFS
{ a[[=12=]]++ } # using [=12=] means entire line must be unique
END { for (i in a) if (a[i]==1) print i } # print unique lines
鉴于您对数据清理的评论,使用 two-pass 方法可能会更好:使用您的原始代码提醒您输入错误,然后在单独的传递中截断。
请注意,如果单个列发生更改,您将获得看似有效的输入。这些行不同:
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Afghanistan,Afghanistan,AFG,40462106,Asia,177827,7671,4395,190,4.313743132
我想你想检测到这一点,所以你的健全性检查需要更复杂。