有没有办法根据特定列提取所有重复记录?

Is there way to extract all the duplicate records based on a particular column?

我正在尝试从竖线分隔文件中提取所有(仅)重复值。

我的数据文件有 80 万行和多列,我对第 3 列特别感兴趣。因此我需要获取第 3 列的重复值并从该文件中提取所有重复行。

然而,我能够实现这一点,如下所示..

cat Report.txt | awk -F'|' '{print }' | sort | uniq -d >dup.txt

然后我将上面的内容循环如下所示..

while read dup
do
   grep "$dup" Report.txt >>only_dup.txt
done <dup.txt

我也试过awk方法

while read dup
do
awk -v a=$dup ' == a { print [=12=] }' Report.txt>>only_dup.txt
done <dup.txt

但是,由于我的文件中有大量记录,因此需要很长时间才能完成。所以我正在寻找一种简单快捷的替代方法。

比如我有这样的数据:

1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements

我的预期输出不包括唯一记录:

1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements

这可能是您想要的:

$ awk -F'|' 'NR==FNR{cnt[]++; next} cnt[]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team

或者如果文件对于所有键($3 值)来说太大而无法放入内存(这对于 800,000 行中唯一的 $3 值应该不是问题):

$ cat tst.awk
BEGIN { FS="|" }
{ currKey =  }
currKey == prevKey {
    if ( !prevPrinted++ ) {
        print prevRec
    }
    print
    next
}
{
    prevKey = currKey
    prevRec = [=11=]
    prevPrinted = 0
}

$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team

EDIT2: 根据 Ed 先生的建议,用更有意义的数组名称 (IMO) 微调了我的建议。

awk '
match([=10=],/[^\|]*\|/){
  val=substr([=10=],RSTART+RLENGTH)
  if(!unique_check_count[val]++){
    numbered_indexed_array[++count]=val
  }
  actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")[=10=]
  line_count_array[val]++
}
END{
  for(i=1;i<=count;i++){
    if(line_count_array[numbered_indexed_array[i]]>1){
      print actual_valued_array[numbered_indexed_array[i]]
    }
  }
}
'  Input_file

Ed Morton 编辑:FWIW 这是我在上面的代码中命名变量的方式:

awk '
match([=11=],/[^\|]*\|/) {
  key = substr([=11=],RSTART+RLENGTH)
  if ( !numRecs[key]++ ) {
    keys[++numKeys] = key
  }
  key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") [=11=]
}
END {
  for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
    key = keys[keyNr]
    if ( numRecs[key]>1 ) {
      print key2recs[key]
    }
  }
}
' Input_file


编辑: 由于 OP 将 Input_file 更改为 |delimited 因此将代码稍微更改为如下,它处理新的 Input_file(感谢 Ed Morton 先生指出)。

awk '
match([=12=],/[^\|]*\|/){
  val=substr([=12=],RSTART+RLENGTH)
  if(!a[val]++){
    b[++count]=val
  }
  c[val]=(c[val]?c[val] ORS:"")[=12=]
  d[val]++
}
END{
  for(i=1;i<=count;i++){
    if(d[b[i]]>1){
      print c[b[i]]
    }
  }
}
'   Input_file


能否请您尝试以下,以下将以与 Input_file 中出现的行相同的顺序给出输出。

awk '
match([=13=],/[^ ]* /){
  val=substr([=13=],RSTART+RLENGTH)
  if(!a[val]++){
    b[++count]=val
  }
  c[val]=(c[val]?c[val] ORS:"")[=13=]
  d[val]++
}
END{
  for(i=1;i<=count;i++){
    if(d[b[i]]>1){
      print c[b[i]]
    }
  }
}
'  Input_file

输出如下。

2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements

上面代码的解释:

awk '                                 ##Starting awk program here.
match([=15=],/[^ ]* /){                   ##Using match function of awk which matches regex till first space is coming.
  val=substr([=15=],RSTART+RLENGTH)       ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
  if(!a[val]++){                      ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
    b[++count]=val                    ##Creating array b whose index is increment value of variable count and value is val variable.
  }                                   ##Closing BLOCK for if condition of array a here.
  c[val]=(c[val]?c[val] ORS:"")[=15=]     ##Creating array named c whose index is variable val and value is [=15=] along with keep concatenating its own value each time it comes here.
  d[val]++                            ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
}                                     ##Closing BLOCK for match here.
END{                                  ##Starting END BLOCK section for this awk program here.
  for(i=1;i<=count;i++){              ##Starting for loop from i=1 to till value of count here.
    if(d[b[i]]>1){                    ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
      print c[b[i]]                   ##Printing value of array c whose index is b[i].
    }
  }
}
'  Input_file                         ##Mentioning Input_file name here.

awk 中的另一个:

$ awk -F\| '{                  # set delimiter
    n=                       # store number
    sub(/^[^|]*/,"",[=10=])        # remove number from string
    if([=10=] in a) {              # if [=10=] in a
        if(a[[=10=]]==1)           # if [=10=] seen the second time
            print b[[=10=]] [=10=]     # print first instance
        print n [=10=]             # also print current
    }
    a[[=10=]]++                    # increase match count for [=10=]
    b[[=10=]]=n                    # number stored to b and only needed once
}' file

示例数据的输出:

2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team

此外,这行得通吗:

$ sort -k 2 file | uniq -D -f 1

-k2,5或smth。 不,因为分隔符从 space 更改为管道。

两个改进步骤。
第一步:
之后

awk -F'|' '{print }' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt

你可以使用

grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt

第二步:
使用其他更好的答案中给出的 awk