如何只输出唯一的基因 ID？

Question

我正在 nano 中使用以下命令处理一个项目：

from Bio import SeqIO
import sys
import re 

     fasta_file = (sys.argv[1])
        for myfile in SeqIO.parse(fasta_file, "fasta"):
          if len(myfile) > 250:
           gene_id = myfile.id
           mylist = re.match(r"H149xcV_\w+_\w+_\w+", gene_id)
           print (">"+list.group(0))

及其提供以下输出：

    >H149xcV_Fge342_r3_h2_d1
    >H149xcV_bTr423_r3_h2_d1
    >H149xcV_kN893_r3_h2_d1
    >H149xcV_DNp021_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1

我如何更改我的命令，以便它为我提供 UNIQUE:

>H149xcV_Fge342_r3_h2
>H149xcV_bTr423_r3_h2
>H149xcV_kN893_r3_h2
>H149xcV_DNp021_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_SRt424234_r3_h2

Answer 1

你可以用类明确表示 \w+ 将匹配 [a-zA-Z0-9_] 所以即使你有多个 \w+ 也没关系。

H149xcV_[a-zA-Z0-9]+_[a-zA-Z0-9]+_[a-zA-Z0-9]+

Regex Demo

在开发正则表达式时尝试使用正则表达式Cheatsheet，它有很大帮助。

一个小聪明的方法：

(H149xcV(_[a-zA-z0-9]+){3})

(                   start of group 1
H149xcV             match literal text
(                   start of sub-group 1
_                   match underscore
[a-zA-Z0-9]         word with digits
+                   more than one occurrence
)                   end of sub-group 1
{3}                 should repeat 3 times 
)                   end of group 1

Regex Demo

Answer 2

您可以使用捕获组并在替换中使用它。

为防止不必要的回溯，您可以使用否定字符从单词字符中排除下划线 class [^\W_]+

(H149xcV_[^\W_]+_[^\W_]+)_[^\W_]+

Regex demo

list = re.match(r"(H149xcV_[^\W_]+_[^\W_]+)_[^\W_]+", gene_id)
print (">"+list.group(1))

Answer 3

如果您只对正则表达式匹配的一部分感兴趣，请使用组来挑出该部分：

from Bio import SeqIO
import sys
import re 

fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
    if len(myfile) > 250:
        gene_id = myfile.id
        list = re.match(r"(H149xcV_\w+_\w+)_\w+", gene_id)
        print (">"+list.group(1))

这应该能让您获得所需的输出。

您还询问了确保输出中没有重复项的问题。为此，您需要记录您已经写过的内容，这意味着它们最终都在内存中——如果您仍然这样做，您最好在内存中构建列表并在完成后写入。这是假设您的数据集没有大到无法放入内存。

解决方案如下：

from Bio import SeqIO
import sys
import re 

fasta_file = (sys.argv[1])
# by collecting results in a set, they are guaranteed to be unique
result = set()
for myfile in SeqIO.parse(fasta_file, "fasta"):
    if len(myfile) > 250:
        gene_id = myfile.id
        m = re.match(r"(H149xcV_\w+_\w+)_\w+", gene_id)
        if m.group(1) not in result:
            print(">"+m.group(1))
        result.add(m.group(1))

另一种方法是构建 result 并在完成后打印它，但这样做的缺点是结果不再与原始顺序相同，尽管它会快一点（因为您不再需要检查每一行是否 m.group(1) not in result）。

如何只输出唯一的基因 ID？

How to output only unique gene id's?

python

nano

fasta