按 bash 中指定的序列长度过滤掉 FASTA 文件
Filter out FASTA files by specified sequence length in bash
有一个包含重叠群名称和相应序列的 FASTA 文件 assembly.fasta
:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
>contig_3
CGTGATCTCGCCATTCGTGCCG
我只想获取长度超过 30 个字母的重叠群并获取一个新的 FASTA 文件 assembly.filtered.fasta
只包含那些具有重叠群名称的长序列,格式如下:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
能否请您尝试按照显示的示例进行测试和编写。
awk '
/^>/{
if(sign_val && strLen>=30){
print sign_val ORS line
}
strLen=line=""
sign_val=[=10=]
next
}
{
strLen+=length([=10=])
line=(line?line ORS:"")[=10=]
}
END{
if(sign_val && strLen>=30){
print sign_val ORS line
}
}
' Input_file
解释:为以上添加详细解释。
awk ' ##Starting awk program from here.
/^>/{ ##Checking condition if line starts from > then do following.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
strLen=line="" ##Nullify variables steLen and line here.
sign_val=[=11=] ##Setting sign_val to current line here.
next ##next will skip all further statements from here.
}
{
strLen+=length([=11=]) ##Checking length of line and keep adding it here.
line=(line?line ORS:"")[=11=] ##Creating line variable and keep appending it to it with new line.
}
END{ ##Starting END block of this program from here.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
}
' Input_file ##mentioning Input_file name here.
使用 gnu-awk
,您可以使用这个更简单的版本:
awk -v RS='>[^\n]+\n' 'length() >= 30 {printf "%s", prt [=10=]} {prt = RT}' file
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
实现您所追求的目标的一种非常快速的方法是:
awk -v n=30 '/^>/{ if(l>n) print b; b=[=10=];l=0;next }
{l+=length;b=b ORS [=10=]}END{if(l>n) print b }' file
您可能也对 BioAwk 感兴趣,它是 awk 的改编版本,可以处理 FASTA 文件
bioawk -c fastx -v '(length($seq)>30){print ">" $name ORS $seq}' file.fasta
注意: BioAwk 基于 Brian Kernighan 的 awk,它记录在 "The AWK Programming Language",
作者:Al Aho、Brian Kernighan 和 Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
。我不确定这个版本是否与 POSIX.
兼容
有一个包含重叠群名称和相应序列的 FASTA 文件 assembly.fasta
:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
>contig_3
CGTGATCTCGCCATTCGTGCCG
我只想获取长度超过 30 个字母的重叠群并获取一个新的 FASTA 文件 assembly.filtered.fasta
只包含那些具有重叠群名称的长序列,格式如下:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
能否请您尝试按照显示的示例进行测试和编写。
awk '
/^>/{
if(sign_val && strLen>=30){
print sign_val ORS line
}
strLen=line=""
sign_val=[=10=]
next
}
{
strLen+=length([=10=])
line=(line?line ORS:"")[=10=]
}
END{
if(sign_val && strLen>=30){
print sign_val ORS line
}
}
' Input_file
解释:为以上添加详细解释。
awk ' ##Starting awk program from here.
/^>/{ ##Checking condition if line starts from > then do following.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
strLen=line="" ##Nullify variables steLen and line here.
sign_val=[=11=] ##Setting sign_val to current line here.
next ##next will skip all further statements from here.
}
{
strLen+=length([=11=]) ##Checking length of line and keep adding it here.
line=(line?line ORS:"")[=11=] ##Creating line variable and keep appending it to it with new line.
}
END{ ##Starting END block of this program from here.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
}
' Input_file ##mentioning Input_file name here.
使用 gnu-awk
,您可以使用这个更简单的版本:
awk -v RS='>[^\n]+\n' 'length() >= 30 {printf "%s", prt [=10=]} {prt = RT}' file
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
实现您所追求的目标的一种非常快速的方法是:
awk -v n=30 '/^>/{ if(l>n) print b; b=[=10=];l=0;next }
{l+=length;b=b ORS [=10=]}END{if(l>n) print b }' file
您可能也对 BioAwk 感兴趣,它是 awk 的改编版本,可以处理 FASTA 文件
bioawk -c fastx -v '(length($seq)>30){print ">" $name ORS $seq}' file.fasta
注意: BioAwk 基于 Brian Kernighan 的 awk,它记录在 "The AWK Programming Language", 作者:Al Aho、Brian Kernighan 和 Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) 。我不确定这个版本是否与 POSIX.
兼容