Select fasta 文件中超过 300 个 aa 且 "C" 的序列至少出现 4 次
Select sequences in a fasta file with more than 300 aa and "C" occurs at least 4 times
我有一个包含蛋白质序列的 fasta 文件。我想 select 超过 300 个氨基酸且半胱氨酸 (C) 氨基酸出现超过 4 次的序列。
我用这个命令 select 超过 300 个 aa 的序列:
cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }'
一些序列示例:
>jgi|Triasp1|216614|CE216613_3477
MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ*
我不知道 bioawk
但我认为它与 awk 相同,具有一些初始解析和常量定义。
我将按以下方式进行。假设你想找到字母 C
的 4 倍以上且长度超过 300 的字符串,那么你可以这样做:
bioawk -c fastx '
(length($seq) > 300) && (gsub("C","C",$seq)>4) {
print ">"$name; print $seq
}' 72hDOWN-fasta.fasta
但这假定 seq
是完整的字符序列。
其背后的想法如下。 gsub
命令在字符串中执行替换,returns 它执行的总替换。因此,如果我们用 "C" 替换所有字符 "C" 我们实际上并没有改变字符串,而是得到字符串中 "C" 的总数。
From the POSIX standard IEEE Std 1003.1-2017:
gsub(ere, repl[, in])
: Behave like sub
(see below), except that it shall replace all occurrences of the regular expression (like
the ed
utility global substitute) in [=18=]
or in the in argument,
when specified.
sub(ere, repl[, in ])
: Substitute the string repl
in place of the first instance of the extended regular expression ere
in string in
and return the number of substitutions. An <ampersand> ( &
) appearing in the string repl
shall be replaced by the string from in
that matches the ERE. An <ampersand> preceded with a
<backslash> shall be interpreted as the literal
<ampersand> character. An occurrence of two consecutive
<backslash> characters shall be interpreted as just a single
literal <backslash> character. Any other occurrence of a
<backslash> (for example, preceding any other character) shall
be treated as a literal <backslash> character. Note that if repl
is a string literal (the lexical token STRING; see Grammar), the
handling of the <ampersand> character occurs after any lexical
processing, including any lexical <backslash>-escape sequence
processing. If in
is specified and it is not an lvalue (see
Expressions in awk), the behavior is undefined. If in
is omitted, awk
shall use the current record ([=18=]
) in its place.
注意: BioAwk is based on Brian Kernighan's awk 记录在 "The AWK Programming Language",
作者:Al Aho、Brian Kernighan 和 Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
。我不确定这个版本是否与 POSIX.
兼容
我有一个包含蛋白质序列的 fasta 文件。我想 select 超过 300 个氨基酸且半胱氨酸 (C) 氨基酸出现超过 4 次的序列。
我用这个命令 select 超过 300 个 aa 的序列:
cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }'
一些序列示例:
>jgi|Triasp1|216614|CE216613_3477
MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ*
我不知道 bioawk
但我认为它与 awk 相同,具有一些初始解析和常量定义。
我将按以下方式进行。假设你想找到字母 C
的 4 倍以上且长度超过 300 的字符串,那么你可以这样做:
bioawk -c fastx '
(length($seq) > 300) && (gsub("C","C",$seq)>4) {
print ">"$name; print $seq
}' 72hDOWN-fasta.fasta
但这假定 seq
是完整的字符序列。
其背后的想法如下。 gsub
命令在字符串中执行替换,returns 它执行的总替换。因此,如果我们用 "C" 替换所有字符 "C" 我们实际上并没有改变字符串,而是得到字符串中 "C" 的总数。
From the POSIX standard IEEE Std 1003.1-2017:
gsub(ere, repl[, in])
: Behave likesub
(see below), except that it shall replace all occurrences of the regular expression (like theed
utility global substitute) in[=18=]
or in the in argument, when specified.
sub(ere, repl[, in ])
: Substitute the stringrepl
in place of the first instance of the extended regular expressionere
in stringin
and return the number of substitutions. An <ampersand> (&
) appearing in the stringrepl
shall be replaced by the string fromin
that matches the ERE. An <ampersand> preceded with a <backslash> shall be interpreted as the literal <ampersand> character. An occurrence of two consecutive <backslash> characters shall be interpreted as just a single literal <backslash> character. Any other occurrence of a <backslash> (for example, preceding any other character) shall be treated as a literal <backslash> character. Note that ifrepl
is a string literal (the lexical token STRING; see Grammar), the handling of the <ampersand> character occurs after any lexical processing, including any lexical <backslash>-escape sequence processing. Ifin
is specified and it is not an lvalue (see Expressions in awk), the behavior is undefined. Ifin
is omitted, awk shall use the current record ([=18=]
) in its place.
注意: BioAwk is based on Brian Kernighan's awk 记录在 "The AWK Programming Language", 作者:Al Aho、Brian Kernighan 和 Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) 。我不确定这个版本是否与 POSIX.
兼容