Select fasta 文件中超过 300 个 aa 且 "C" 的序列至少出现 4 次

Question

我有一个包含蛋白质序列的 fasta 文件。我想 select 超过 300 个氨基酸且半胱氨酸 (C) 氨基酸出现超过 4 次的序列。

我用这个命令 select 超过 300 个 aa 的序列：

 cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }'

一些序列示例：

  >jgi|Triasp1|216614|CE216613_3477
 MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
 NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
 YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
 AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
 YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
 MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
 QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ*

Answer 1

我不知道 bioawk 但我认为它与 awk 相同，具有一些初始解析和常量定义。

我将按以下方式进行。假设你想找到字母 C 的 4 倍以上且长度超过 300 的字符串，那么你可以这样做：

bioawk -c fastx '
   (length($seq) > 300) && (gsub("C","C",$seq)>4) {
       print ">"$name; print $seq
   }' 72hDOWN-fasta.fasta

但这假定 seq 是完整的字符序列。

其背后的想法如下。 gsub 命令在字符串中执行替换，returns 它执行的总替换。因此，如果我们用 "C" 替换所有字符 "C" 我们实际上并没有改变字符串，而是得到字符串中 "C" 的总数。

From the POSIX standard IEEE Std 1003.1-2017:

gsub(ere, repl[, in]): Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in [=18=] or in the in argument, when specified.

sub(ere, repl[, in ]): Substitute the string repl in place of the first instance of the extended regular expression ere in string in and return the number of substitutions. An <ampersand> ( & ) appearing in the string repl shall be replaced by the string from in that matches the ERE. An <ampersand> preceded with a <backslash> shall be interpreted as the literal <ampersand> character. An occurrence of two consecutive <backslash> characters shall be interpreted as just a single literal <backslash> character. Any other occurrence of a <backslash> (for example, preceding any other character) shall be treated as a literal <backslash> character. Note that if repl is a string literal (the lexical token STRING; see Grammar), the handling of the <ampersand> character occurs after any lexical processing, including any lexical <backslash>-escape sequence processing. If in is specified and it is not an lvalue (see Expressions in awk), the behavior is undefined. If in is omitted, awk shall use the current record ([=18=]) in its place.

注意： BioAwk is based on Brian Kernighan's awk 记录在 "The AWK Programming Language"，作者：Al Aho、Brian Kernighan 和 Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) 。我不确定这个版本是否与 POSIX.

兼容

Select fasta 文件中超过 300 个 aa 且 "C" 的序列至少出现 4 次

Select sequences in a fasta file with more than 300 aa and "C" occurs at least 4 times

linux

awk

sequences

bioinformatics

fasta