将一个大的 fasta 文件拆分成多个文件的 biopython 脚本

A biopython script to split a large fasta file into multiple ones

我正在处理一个大型 fasta 文件,我想根据基因 ID 拆分成多个文件。我正在尝试使用 biopython 教程中的上述脚本:

def batch_iterator(iterator, batch_size):
    """Returns lists of length batch_size.

    This can be used on any iterator, for example to batch up
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch
    Alignment objects from Bio.AlignIO.parse(...), or simply
    lines from a file handle.

    This is a generator function, and it returns lists of the
    entries from the supplied iterator.  Each list will have
    batch_size entries, although the final list may be shorter.
    """
    entry = True  # Make sure we loop once
    while entry:
        batch = []
        while len(batch) < batch_size:
            try:
                entry = iterator.next()
            except StopIteration:
                entry = None
            if entry is None:
                # End of file
                break
            batch.append(entry)
        if batch:
            yield batch 

record_iter=SeqIO.parse(open('/path/sorted_sequences.fa'), 'fasta')
for i, batch in enumerate (batch_iterator(record_iter, 93)):
    filename='gene_%i.fasta' % (i + 1)
    with open('/path/files/' + filename, 'w') as ouput_handle:
        count=SeqIO.write(batch, ouput_handle, 'fasta')
    print ('Wrote %i records to %s' % (count, filename))

它确实将文件拆分为 93 个序列,但每组 93 个文件给出 2 个文件。我看不到错误,但我猜是有一个。 还有另一种方法可以以不同的方式拆分大型 fasta 文件吗? 谢谢

阅读示例中的代码后,迭代器似乎并没有按照 gene id 分隔文件,而只是将序列分成 batch_size 组,因此在您的情况下,每个文件有 93 个序列。

以防以后有人对这个脚本感兴趣。该脚本按原样完美运行。问题是我试图分割的文件的序列比它应该的多。所以我删除了坏文件,并生成了一个与上面的脚本很好地分割的新文件。