unix 中的 Csv 文件操作

Csv file manipulation in unix

我有一个这样的 csv 文件

"ID","NAME","TIME"
"858","abc","21:38:52"
"874","ghi","18:20:33"
"858","abc","19:38:52"
"978","def","21:38:52"
"874","ghi","13:20:33"
"319","ghi","13:24:50"
"319","ghi","22:29:16"

有些记录是相同的,只是时间不同(在第三列表示),基本上我只想要最新的记录。我需要一个命令来识别重复记录并删除具有较旧时间戳的记录,因此我的输出文件如下所示:

"ID","NAME","TIME"
"858","abc","21:38:52"
"978","def","21:38:52"
"874","ghi","18:20:33"
"319","ghi","22:29:16"

能否请您尝试关注 awk

awk -F"," '!b[,]++{c[++count]= OFS } {a[,]=[=10=]} END{for(i=1;i<=count;i++){print a[c[i]]}}' SUBSEP=" " Input_file

现在也添加了一种非线性形式的解决方案。

awk -F"," '
!b[,]++           {  c[++count]= OFS   }
                      {  a[,]=[=11=]           }
END{
 for(i=1;i<=count;i++){ print a[c[i]]          }
}
' SUBSEP=" "  Input_file

解释:

awk -F"," '
!b[,]++           {  c[++count]= OFS   } ##Checking if array b whose index is , value is NOT more than 1 if yes then create array c whose index is variable count(whose value is getting incremented each time cursor comes here) and its value is  OFS .
                      {  a[,]=[=12=]           } ##Then creating an array named a whose index is ,(first and second field of current line) and value is current line value.
END{                                             ##Starting END block of awk here.
 for(i=1;i<=count;i++){ print a[c[i]]          } ##Starting a for loop whose values starts from i=1 to till count value and printing array a value whose index is array c value, where array c index is variable i.
}
' SUBSEP=" " Input_file                          ##Setting SUBSEP as space for array and mentioning Input_file name here.
$ tac file | awk -F, '!seen[]++' | tac
"ID","NAME","TIME"
"858","abc","21:38:52"
"978","def","21:38:52"
"874","ghi","13:20:33"
"319","ghi","22:29:16"

到目前为止,有几个答案隐含地依赖于文件中已经 ordered/sorted 的时间戳。以下不做这样的假设:

#! /usr/bin/awk -f
BEGIN { FS = OFS = SUBSEP = "," }
FNR == 1 {
    split("", tt) # Time Text
    split("", ts) # Numeric Timestamp
    print
    next
}
{
    t = 
    gsub(/[":]/, "", t)
    if (((, ) in ts) && (ts[, ] >= t))
        next
    ts[, ] = t
    tt[, ] = 
}
END {
    for (x in tt)
        print x, tt[x]
}