Sed/Awk:如果第一行中的模式重复,如何查找和删除两行; bash

Sed/Awk: how to find and remove two lines if a pattern in the first line is being repeated; bash

我正在处理每个文件有数千条记录的文本文件。每条记录由两行组成:以“>”开头的header,后面是一长串字符“-AGTCNR”。 header 有 10 个字段,用“|”分隔其第一个字段是每条记录的唯一标识符,例如“>KEN096-15”,如果一条记录具有相同的标识符,则该记录被称为重复。下面是一个简单记录的样子:

>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2  
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------  
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----  
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co  
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG  
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c  
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------  
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru  
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------  
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_  
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT  
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG  
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA  
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA  
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA  
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_  
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---  
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA  
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA  
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----------TCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----  

现在我正在尝试删除重复项,例如 "ACRJP458-10" 和 "PMANL2431-12" 的重复记录。 使用 bash 脚本,我提取了唯一标识符并将重复的标识符存储在变量“$duplicate_headers”中。目前,我正试图找到他们的 two-line 记录的任何重复实例并按如下方式删除它们:

for i in "$@"
do
    unset duplicate_headers
    duplicate_headers=`grep ">"  | awk 'BEGIN { FS="|"}; {print  "\n"; }' | sort | uniq -d`
    for header in `echo -e "${duplicate_headers}"`
    do
        sed -i "/^.*\b${header}\b.*$/,+1 2d" $i
        #sed -i "s/^.*\b${header}\b.*$//,+1 2g" $i
        #sed -i "/^.*\b${header}\b.*$/{$!N; s/.*//2g; }" $i
    done
done

最终结果(考虑到数千条记录)将如下所示:

>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2  
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------  
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----  
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co  
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG  
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c  
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------  
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru  
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------  
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_  
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT  
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG  
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA  
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N  
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA  
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA  
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA  
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N  
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
$ awk -F'[|]' 'NR%2{f=seen[]++} !f' file
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA

要运行一次在多个文件上删除所有文件中的重复项:

awk -F'[|]' 'FNR%2{f=seen[]++} !f' *

或仅删除每个文件中的重复项:

awk -F'[|]' 'FNR==1{delete seen} FNR%2{f=seen[]++} !f' *