从另一个文件中删除与特定模式匹配的行
Removing lines which match with specific pattern from another file
我有两个文件(我只显示了这些文件的开头):
patterns.txt
m64071_201130_104452/13
m64071_201130_104452/26
m64071_201130_104452/46
m64071_201130_104452/49
m64071_201130_104452/113
m64071_201130_104452/147
myfile.txt
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
我应该得到这样的输出:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
如果 patterns.txt 中的行与 myfile.txt 中的行匹配,我想创建一个新文件。我需要保留与相关模式关联的字母 ACTG。我使用:
for i in $(cat patterns.txt); do
grep -A 1 $i myfile.txt; done > my_newfile.txt
它可以工作,但创建新文件的速度非常慢...我处理的文件非常大但不是太多(patterns.txt 为 14M,myfile.txt 为 700M)。
我也尝试使用 grep -v
,因为我有另一个文件,其中包含 patterns.txt 中不存在的 myfile.txt 的其他模式。但是还是一样的“速填文件”问题
如果您看到解决方案..
使用您显示的示例,请尝试执行以下操作。在 GNU awk
.
中编写和测试
awk '
FNR==NR{
arr[[=10=]]
next
}
/^>/{
found=0
match([=10=],/.*\//)
if((substr([=10=],RSTART+1,RLENGTH-2)) in arr){
print
found=1
}
next
}
found
' patterns.txt myfile.txt
说明: 为以上添加详细说明。
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when patterns.txt is being read.
arr[[=11=]] ##Creating array with index of current line.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found=0 ##Unsetting found here.
match([=11=],/.*\//) ##using match to match a regex to till / in current line.
if((substr([=11=],RSTART+1,RLENGTH-2)) in arr){ ##Checking condition if sub string of matched regex is present in arr then do following.
print ##Printing current line here.
found=1 ##Setting found to 1 here.
}
next ##next will skip all further statements from here.
}
found ##Printing the line if found is set.
' patterns.txt myfile.txt ##Mentioning Input_file names here.
另一个 awk:
$ awk -F/ ' # / delimiter
NR==FNR {
a[,] # hash patterns to a
next
}
{
if( tf=((substr(,2),) in a) ) # if first part found in hash
print # output and store found result in var tf
if(getline && tf) # read next record and if previous record was found
print # output
}' patterns myfile
输出:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
编辑:输出不找到的:
$ awk -F/ ' # / delimiter
NR==FNR {
a[,] # hash patterns to a
next
}
{
if( tf=((substr(,2),) in a) ) { # if first part found in hash
getline # consume the next record too
next
}
print # otherwise output
}' patterns myfile
输出:
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
我有两个文件(我只显示了这些文件的开头):
patterns.txt
m64071_201130_104452/13
m64071_201130_104452/26
m64071_201130_104452/46
m64071_201130_104452/49
m64071_201130_104452/113
m64071_201130_104452/147
myfile.txt
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
我应该得到这样的输出:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
如果 patterns.txt 中的行与 myfile.txt 中的行匹配,我想创建一个新文件。我需要保留与相关模式关联的字母 ACTG。我使用:
for i in $(cat patterns.txt); do
grep -A 1 $i myfile.txt; done > my_newfile.txt
它可以工作,但创建新文件的速度非常慢...我处理的文件非常大但不是太多(patterns.txt 为 14M,myfile.txt 为 700M)。
我也尝试使用 grep -v
,因为我有另一个文件,其中包含 patterns.txt 中不存在的 myfile.txt 的其他模式。但是还是一样的“速填文件”问题
如果您看到解决方案..
使用您显示的示例,请尝试执行以下操作。在 GNU awk
.
awk '
FNR==NR{
arr[[=10=]]
next
}
/^>/{
found=0
match([=10=],/.*\//)
if((substr([=10=],RSTART+1,RLENGTH-2)) in arr){
print
found=1
}
next
}
found
' patterns.txt myfile.txt
说明: 为以上添加详细说明。
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when patterns.txt is being read.
arr[[=11=]] ##Creating array with index of current line.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found=0 ##Unsetting found here.
match([=11=],/.*\//) ##using match to match a regex to till / in current line.
if((substr([=11=],RSTART+1,RLENGTH-2)) in arr){ ##Checking condition if sub string of matched regex is present in arr then do following.
print ##Printing current line here.
found=1 ##Setting found to 1 here.
}
next ##next will skip all further statements from here.
}
found ##Printing the line if found is set.
' patterns.txt myfile.txt ##Mentioning Input_file names here.
另一个 awk:
$ awk -F/ ' # / delimiter
NR==FNR {
a[,] # hash patterns to a
next
}
{
if( tf=((substr(,2),) in a) ) # if first part found in hash
print # output and store found result in var tf
if(getline && tf) # read next record and if previous record was found
print # output
}' patterns myfile
输出:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
编辑:输出不找到的:
$ awk -F/ ' # / delimiter
NR==FNR {
a[,] # hash patterns to a
next
}
{
if( tf=((substr(,2),) in a) ) { # if first part found in hash
getline # consume the next record too
next
}
print # otherwise output
}' patterns myfile
输出:
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG