我有一个包含数百万个序列的 fasta 文件。我只想提取 .txt 文件中名称匹配的那些,我该怎么做呢?

I have a fasta file with millions of sequences. I want to only extract those that's names match within a .txt file, how can I go about doing this?

我一直在整理一个 ~1.5m 的 read fasta 文件 ('V1_6D_contigs_5kbp.fa') 以确定哪些 reads 可能是 'viral' 的来源。此文件中的读数表示为 Vx_Cz - 其中 x 为 1-6,具体取决于它来自哪个试验组,z 是来自 1-~1.5m 的重叠群 number/name。例如 V1_C10810 或 V3_C587937...

通过不同的生物信息学管道,我生成了一个 .txt 文件,其中包含预测(<0.05)为病毒的重叠群名称列表(2699 长)。我现在需要使用这个预测的重叠群列表来提取和生成一个只包含这些重叠群的新 fasta 文件。

我的代码背后的理论思想是它打开 .txt 文件(每个重要重叠群的名称)和原始 fasta 文件,遍历 .txt 文件的每一行并将行(重叠群名称)设置为一个变量。然后它应该遍历包含所有序列信息的原始 fasta 文件,如果重叠群名称与 record.id(来自原始文件的重叠群名称)匹配,它应该将完整的记录信息导出到一个新文件。

我想我已经接近了,但我当前的迭代似乎 运行 只有一个或另一个循环,正如我所期望的那样。

请看下面的代码。我在下面添加了注释,说明我尝试过的每个程序的 运行 错误。

我正在使用 Python,包括 SeqIO Biopython 应用程序。

#!/usr/bin/env python

from Bio import SeqIO

Lines = []
Counter = 0 

With open(‘VIBRANT_contigs.txt’) as f:
 lines = f.readlines()
 original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
 corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
  with open(original_file) as original, open(corrected_file,'w') as corrected:
   records = SeqIO.parse(original_file,'fasta')
    for record in records: 
     id = record.id
     print(“this is the id”, id)
     for line in lines:
      contig = line
      print(“This is the contig”, contig)
       if contig == id:
        SeqIO.write(record, corrected, ‘fasta’)  
         Counter = (counter + 1)

这只打印了“this is the id”的列表,id

#!/usr/bin/env python

Lines = []
With open(‘VIBRANT_contigs.txt’) as f:
 lines = f.readlines()
     for line in lines:
      contig = line
      print(contig)

这有效并打印 Vibrant_contigs.txt

中的重叠群列表
>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
...  lines = f.readlines()
...  original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
...  corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
...  with open(original_file) as original, open(corrected_file,'w') as corrected:
...   records = SeqIO.parse(original_file,'fasta')
...   for line in lines:
...    contig = line
...    for record in records:
...     id = record.id
...     if contig == id:
...      print("we got a hit")

不打印任何内容


>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
...  lines = f.readlines()
...  original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
...  corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
...  with open(original_file) as original, open(corrected_file,'w') as corrected:
...   records = SeqIO.parse(original_file,'fasta')
...   for line in lines:
...    contig = line
...    print("this is the contig", contig)
...    for record in records:
...     id = record.id
...     print("this is the id", id)
...     if contig == id:
...      print("we got a hit")

打印“这是 contig V1_C1005406”(列表中的第一个 contig)

然后继续打印所有 ID

然后打印所有的重叠群

非常感谢任何帮助,因为我很困。

感谢您的宝贵时间。

在很多拼写错误中,主要问题是来自 lines=f.readlines() 的行仍将包含换行符 \n,因此永远不会匹配来自 SeqIO 的 ID,解决方案是使用一个简单的 strip() 调用:

from Bio import SeqIO

Lines = []
Counter = 0

with open("ids_to_extract.txt") as f:
    lines = f.readlines()
    original_file = r"input_sequences.fasta"
    corrected_file = r"extracted_sequences.fasta"
    with open(original_file) as original, open(corrected_file,'w') as corrected:
        records = SeqIO.parse(original_file,'fasta')
        for record in records:
            id = record.id
            print("this is the id", id)
            for line in lines:
                contig = line.strip() # <--- fixed here
                print("This is the contig", contig)
                if contig == id:
                    SeqIO.write(record, corrected, 'fasta')
                    Counter = (Counter + 1)

输入的数据是这样的:

input_sequences.fasta:

>SRR10971381.1 1 length=151
CAAAGTCAAGCTCCCTTCTGCCTTTACACTCTTCGAGCGATTTCCGTCCGCTCTGAGGGAACCTTTGGGCGCCTCCGTTACTCTTTTGGAGGCGACCGCCCCAGTCAAACTGCCCGCCTAAGACTGTCCGGCCGGTCNTTACTCGGCNCGT
>SRR10971381.2 2 length=151
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>SRR10971381.3 3 length=47
TACGATGTCGTTCCGGAAGGTAAGGGCGTGGATGAGACTAAGATCGG
>SRR10971381.4 4 length=149
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
[....]

ids_to_extract:(SeqIo 在空格处拆分,所以 id 应该只到第一个空格)

SRR10971381.1
SRR10971381.2
SRR10971381.3
SRR10971381.4

也就是说,这段代码仍然很低效。对于每个重叠群,您都在遍历文件的所有行。更容易的是生成你需要的所有 ID 的 set,然后只检查它是否在 set 中,将检查的平均运行时间从 O(n) 增加到 O(1) (reference):

from Bio import SeqIO

Lines = []
Counter = 0

#Pre-Read all ids to extract into a set
with open("ids_to_extract.txt") as f:
    ids_to_extract = {line.strip() for line in f.readlines()}

original_file = r"input_sequences.fasta"
corrected_file = r"extracted_sequences.fasta"
with open(original_file) as original, open(corrected_file,'w') as corrected:
    records = SeqIO.parse(original_file,'fasta')
    for record in records:
        recordId = record.id
        # Now a simplle `in` check
        if recordId in ids_to_extract:
            SeqIO.write(record, corrected, 'fasta')
            Counter = (Counter + 1)

另请注意,还有其他几个快速简单的选项可以执行此操作:

# using Heng Li's seqkit
seqkit grep -f ids_to_extract.txt input_sequences.fasta
# using bioawk
bioawk -c fastx -v file="ids_to_extract.txt" 'BEGIN{while((getline k < file)>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}'  input_sequences.fasta > extracted_sequences.fasta
seqtk subseq <original.fa> <contigs.txt> > out.fa