unix 中的 Csv 文件操作
Csv file manipulation in unix
我有一个这样的 csv 文件
"ID","NAME","TIME"
"858","abc","21:38:52"
"874","ghi","18:20:33"
"858","abc","19:38:52"
"978","def","21:38:52"
"874","ghi","13:20:33"
"319","ghi","13:24:50"
"319","ghi","22:29:16"
有些记录是相同的,只是时间不同(在第三列表示),基本上我只想要最新的记录。我需要一个命令来识别重复记录并删除具有较旧时间戳的记录,因此我的输出文件如下所示:
"ID","NAME","TIME"
"858","abc","21:38:52"
"978","def","21:38:52"
"874","ghi","18:20:33"
"319","ghi","22:29:16"
能否请您尝试关注 awk
。
awk -F"," '!b[,]++{c[++count]= OFS } {a[,]=[=10=]} END{for(i=1;i<=count;i++){print a[c[i]]}}' SUBSEP=" " Input_file
现在也添加了一种非线性形式的解决方案。
awk -F"," '
!b[,]++ { c[++count]= OFS }
{ a[,]=[=11=] }
END{
for(i=1;i<=count;i++){ print a[c[i]] }
}
' SUBSEP=" " Input_file
解释:
awk -F"," '
!b[,]++ { c[++count]= OFS } ##Checking if array b whose index is , value is NOT more than 1 if yes then create array c whose index is variable count(whose value is getting incremented each time cursor comes here) and its value is OFS .
{ a[,]=[=12=] } ##Then creating an array named a whose index is ,(first and second field of current line) and value is current line value.
END{ ##Starting END block of awk here.
for(i=1;i<=count;i++){ print a[c[i]] } ##Starting a for loop whose values starts from i=1 to till count value and printing array a value whose index is array c value, where array c index is variable i.
}
' SUBSEP=" " Input_file ##Setting SUBSEP as space for array and mentioning Input_file name here.
$ tac file | awk -F, '!seen[]++' | tac
"ID","NAME","TIME"
"858","abc","21:38:52"
"978","def","21:38:52"
"874","ghi","13:20:33"
"319","ghi","22:29:16"
到目前为止,有几个答案隐含地依赖于文件中已经 ordered/sorted 的时间戳。以下不做这样的假设:
#! /usr/bin/awk -f
BEGIN { FS = OFS = SUBSEP = "," }
FNR == 1 {
split("", tt) # Time Text
split("", ts) # Numeric Timestamp
print
next
}
{
t =
gsub(/[":]/, "", t)
if (((, ) in ts) && (ts[, ] >= t))
next
ts[, ] = t
tt[, ] =
}
END {
for (x in tt)
print x, tt[x]
}
我有一个这样的 csv 文件
"ID","NAME","TIME"
"858","abc","21:38:52"
"874","ghi","18:20:33"
"858","abc","19:38:52"
"978","def","21:38:52"
"874","ghi","13:20:33"
"319","ghi","13:24:50"
"319","ghi","22:29:16"
有些记录是相同的,只是时间不同(在第三列表示),基本上我只想要最新的记录。我需要一个命令来识别重复记录并删除具有较旧时间戳的记录,因此我的输出文件如下所示:
"ID","NAME","TIME"
"858","abc","21:38:52"
"978","def","21:38:52"
"874","ghi","18:20:33"
"319","ghi","22:29:16"
能否请您尝试关注 awk
。
awk -F"," '!b[,]++{c[++count]= OFS } {a[,]=[=10=]} END{for(i=1;i<=count;i++){print a[c[i]]}}' SUBSEP=" " Input_file
现在也添加了一种非线性形式的解决方案。
awk -F"," '
!b[,]++ { c[++count]= OFS }
{ a[,]=[=11=] }
END{
for(i=1;i<=count;i++){ print a[c[i]] }
}
' SUBSEP=" " Input_file
解释:
awk -F"," '
!b[,]++ { c[++count]= OFS } ##Checking if array b whose index is , value is NOT more than 1 if yes then create array c whose index is variable count(whose value is getting incremented each time cursor comes here) and its value is OFS .
{ a[,]=[=12=] } ##Then creating an array named a whose index is ,(first and second field of current line) and value is current line value.
END{ ##Starting END block of awk here.
for(i=1;i<=count;i++){ print a[c[i]] } ##Starting a for loop whose values starts from i=1 to till count value and printing array a value whose index is array c value, where array c index is variable i.
}
' SUBSEP=" " Input_file ##Setting SUBSEP as space for array and mentioning Input_file name here.
$ tac file | awk -F, '!seen[]++' | tac
"ID","NAME","TIME"
"858","abc","21:38:52"
"978","def","21:38:52"
"874","ghi","13:20:33"
"319","ghi","22:29:16"
到目前为止,有几个答案隐含地依赖于文件中已经 ordered/sorted 的时间戳。以下不做这样的假设:
#! /usr/bin/awk -f
BEGIN { FS = OFS = SUBSEP = "," }
FNR == 1 {
split("", tt) # Time Text
split("", ts) # Numeric Timestamp
print
next
}
{
t =
gsub(/[":]/, "", t)
if (((, ) in ts) && (ts[, ] >= t))
next
ts[, ] = t
tt[, ] =
}
END {
for (x in tt)
print x, tt[x]
}