有没有办法根据特定列提取所有重复记录?
Is there way to extract all the duplicate records based on a particular column?
我正在尝试从竖线分隔文件中提取所有(仅)重复值。
我的数据文件有 80 万行和多列,我对第 3 列特别感兴趣。因此我需要获取第 3 列的重复值并从该文件中提取所有重复行。
然而,我能够实现这一点,如下所示..
cat Report.txt | awk -F'|' '{print }' | sort | uniq -d >dup.txt
然后我将上面的内容循环如下所示..
while read dup
do
grep "$dup" Report.txt >>only_dup.txt
done <dup.txt
我也试过awk方法
while read dup
do
awk -v a=$dup ' == a { print [=12=] }' Report.txt>>only_dup.txt
done <dup.txt
但是,由于我的文件中有大量记录,因此需要很长时间才能完成。所以我正在寻找一种简单快捷的替代方法。
比如我有这样的数据:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements
我的预期输出不包括唯一记录:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
这可能是您想要的:
$ awk -F'|' 'NR==FNR{cnt[]++; next} cnt[]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
或者如果文件对于所有键($3 值)来说太大而无法放入内存(这对于 800,000 行中唯一的 $3 值应该不是问题):
$ cat tst.awk
BEGIN { FS="|" }
{ currKey = }
currKey == prevKey {
if ( !prevPrinted++ ) {
print prevRec
}
print
next
}
{
prevKey = currKey
prevRec = [=11=]
prevPrinted = 0
}
$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
EDIT2: 根据 Ed 先生的建议,用更有意义的数组名称 (IMO) 微调了我的建议。
awk '
match([=10=],/[^\|]*\|/){
val=substr([=10=],RSTART+RLENGTH)
if(!unique_check_count[val]++){
numbered_indexed_array[++count]=val
}
actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")[=10=]
line_count_array[val]++
}
END{
for(i=1;i<=count;i++){
if(line_count_array[numbered_indexed_array[i]]>1){
print actual_valued_array[numbered_indexed_array[i]]
}
}
}
' Input_file
Ed Morton 编辑:FWIW 这是我在上面的代码中命名变量的方式:
awk '
match([=11=],/[^\|]*\|/) {
key = substr([=11=],RSTART+RLENGTH)
if ( !numRecs[key]++ ) {
keys[++numKeys] = key
}
key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") [=11=]
}
END {
for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
key = keys[keyNr]
if ( numRecs[key]>1 ) {
print key2recs[key]
}
}
}
' Input_file
编辑: 由于 OP 将 Input_file 更改为 |
delimited 因此将代码稍微更改为如下,它处理新的 Input_file(感谢 Ed Morton 先生指出)。
awk '
match([=12=],/[^\|]*\|/){
val=substr([=12=],RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")[=12=]
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
能否请您尝试以下,以下将以与 Input_file 中出现的行相同的顺序给出输出。
awk '
match([=13=],/[^ ]* /){
val=substr([=13=],RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")[=13=]
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
输出如下。
2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
上面代码的解释:
awk ' ##Starting awk program here.
match([=15=],/[^ ]* /){ ##Using match function of awk which matches regex till first space is coming.
val=substr([=15=],RSTART+RLENGTH) ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
if(!a[val]++){ ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
b[++count]=val ##Creating array b whose index is increment value of variable count and value is val variable.
} ##Closing BLOCK for if condition of array a here.
c[val]=(c[val]?c[val] ORS:"")[=15=] ##Creating array named c whose index is variable val and value is [=15=] along with keep concatenating its own value each time it comes here.
d[val]++ ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
} ##Closing BLOCK for match here.
END{ ##Starting END BLOCK section for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from i=1 to till value of count here.
if(d[b[i]]>1){ ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
print c[b[i]] ##Printing value of array c whose index is b[i].
}
}
}
' Input_file ##Mentioning Input_file name here.
awk 中的另一个:
$ awk -F\| '{ # set delimiter
n= # store number
sub(/^[^|]*/,"",[=10=]) # remove number from string
if([=10=] in a) { # if [=10=] in a
if(a[[=10=]]==1) # if [=10=] seen the second time
print b[[=10=]] [=10=] # print first instance
print n [=10=] # also print current
}
a[[=10=]]++ # increase match count for [=10=]
b[[=10=]]=n # number stored to b and only needed once
}' file
示例数据的输出:
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
此外,这行得通吗:
$ sort -k 2 file | uniq -D -f 1
或-k2,5
或smth。 不,因为分隔符从 space 更改为管道。
两个改进步骤。
第一步:
之后
awk -F'|' '{print }' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt
你可以使用
grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt
第二步:
使用其他更好的答案中给出的 awk
。
我正在尝试从竖线分隔文件中提取所有(仅)重复值。
我的数据文件有 80 万行和多列,我对第 3 列特别感兴趣。因此我需要获取第 3 列的重复值并从该文件中提取所有重复行。
然而,我能够实现这一点,如下所示..
cat Report.txt | awk -F'|' '{print }' | sort | uniq -d >dup.txt
然后我将上面的内容循环如下所示..
while read dup
do
grep "$dup" Report.txt >>only_dup.txt
done <dup.txt
我也试过awk方法
while read dup
do
awk -v a=$dup ' == a { print [=12=] }' Report.txt>>only_dup.txt
done <dup.txt
但是,由于我的文件中有大量记录,因此需要很长时间才能完成。所以我正在寻找一种简单快捷的替代方法。
比如我有这样的数据:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements
我的预期输出不包括唯一记录:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
这可能是您想要的:
$ awk -F'|' 'NR==FNR{cnt[]++; next} cnt[]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
或者如果文件对于所有键($3 值)来说太大而无法放入内存(这对于 800,000 行中唯一的 $3 值应该不是问题):
$ cat tst.awk
BEGIN { FS="|" }
{ currKey = }
currKey == prevKey {
if ( !prevPrinted++ ) {
print prevRec
}
print
next
}
{
prevKey = currKey
prevRec = [=11=]
prevPrinted = 0
}
$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
EDIT2: 根据 Ed 先生的建议,用更有意义的数组名称 (IMO) 微调了我的建议。
awk '
match([=10=],/[^\|]*\|/){
val=substr([=10=],RSTART+RLENGTH)
if(!unique_check_count[val]++){
numbered_indexed_array[++count]=val
}
actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")[=10=]
line_count_array[val]++
}
END{
for(i=1;i<=count;i++){
if(line_count_array[numbered_indexed_array[i]]>1){
print actual_valued_array[numbered_indexed_array[i]]
}
}
}
' Input_file
Ed Morton 编辑:FWIW 这是我在上面的代码中命名变量的方式:
awk '
match([=11=],/[^\|]*\|/) {
key = substr([=11=],RSTART+RLENGTH)
if ( !numRecs[key]++ ) {
keys[++numKeys] = key
}
key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") [=11=]
}
END {
for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
key = keys[keyNr]
if ( numRecs[key]>1 ) {
print key2recs[key]
}
}
}
' Input_file
编辑: 由于 OP 将 Input_file 更改为 |
delimited 因此将代码稍微更改为如下,它处理新的 Input_file(感谢 Ed Morton 先生指出)。
awk '
match([=12=],/[^\|]*\|/){
val=substr([=12=],RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")[=12=]
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
能否请您尝试以下,以下将以与 Input_file 中出现的行相同的顺序给出输出。
awk '
match([=13=],/[^ ]* /){
val=substr([=13=],RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")[=13=]
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
输出如下。
2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
上面代码的解释:
awk ' ##Starting awk program here.
match([=15=],/[^ ]* /){ ##Using match function of awk which matches regex till first space is coming.
val=substr([=15=],RSTART+RLENGTH) ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
if(!a[val]++){ ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
b[++count]=val ##Creating array b whose index is increment value of variable count and value is val variable.
} ##Closing BLOCK for if condition of array a here.
c[val]=(c[val]?c[val] ORS:"")[=15=] ##Creating array named c whose index is variable val and value is [=15=] along with keep concatenating its own value each time it comes here.
d[val]++ ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
} ##Closing BLOCK for match here.
END{ ##Starting END BLOCK section for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from i=1 to till value of count here.
if(d[b[i]]>1){ ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
print c[b[i]] ##Printing value of array c whose index is b[i].
}
}
}
' Input_file ##Mentioning Input_file name here.
awk 中的另一个:
$ awk -F\| '{ # set delimiter
n= # store number
sub(/^[^|]*/,"",[=10=]) # remove number from string
if([=10=] in a) { # if [=10=] in a
if(a[[=10=]]==1) # if [=10=] seen the second time
print b[[=10=]] [=10=] # print first instance
print n [=10=] # also print current
}
a[[=10=]]++ # increase match count for [=10=]
b[[=10=]]=n # number stored to b and only needed once
}' file
示例数据的输出:
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
此外,这行得通吗:
$ sort -k 2 file | uniq -D -f 1
或-k2,5
或smth。 不,因为分隔符从 space 更改为管道。
两个改进步骤。
第一步:
之后
awk -F'|' '{print }' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt
你可以使用
grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt
第二步:
使用其他更好的答案中给出的 awk
。