将一个大的 fasta 文件拆分成多个文件的 biopython 脚本
A biopython script to split a large fasta file into multiple ones
我正在处理一个大型 fasta 文件,我想根据基因 ID 拆分成多个文件。我正在尝试使用 biopython 教程中的上述脚本:
def batch_iterator(iterator, batch_size):
"""Returns lists of length batch_size.
This can be used on any iterator, for example to batch up
SeqRecord objects from Bio.SeqIO.parse(...), or to batch
Alignment objects from Bio.AlignIO.parse(...), or simply
lines from a file handle.
This is a generator function, and it returns lists of the
entries from the supplied iterator. Each list will have
batch_size entries, although the final list may be shorter.
"""
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = iterator.next()
except StopIteration:
entry = None
if entry is None:
# End of file
break
batch.append(entry)
if batch:
yield batch
record_iter=SeqIO.parse(open('/path/sorted_sequences.fa'), 'fasta')
for i, batch in enumerate (batch_iterator(record_iter, 93)):
filename='gene_%i.fasta' % (i + 1)
with open('/path/files/' + filename, 'w') as ouput_handle:
count=SeqIO.write(batch, ouput_handle, 'fasta')
print ('Wrote %i records to %s' % (count, filename))
它确实将文件拆分为 93 个序列,但每组 93 个文件给出 2 个文件。我看不到错误,但我猜是有一个。
还有另一种方法可以以不同的方式拆分大型 fasta 文件吗?
谢谢
阅读示例中的代码后,迭代器似乎并没有按照 gene id 分隔文件,而只是将序列分成 batch_size
组,因此在您的情况下,每个文件有 93 个序列。
以防以后有人对这个脚本感兴趣。该脚本按原样完美运行。问题是我试图分割的文件的序列比它应该的多。所以我删除了坏文件,并生成了一个与上面的脚本很好地分割的新文件。
我正在处理一个大型 fasta 文件,我想根据基因 ID 拆分成多个文件。我正在尝试使用 biopython 教程中的上述脚本:
def batch_iterator(iterator, batch_size):
"""Returns lists of length batch_size.
This can be used on any iterator, for example to batch up
SeqRecord objects from Bio.SeqIO.parse(...), or to batch
Alignment objects from Bio.AlignIO.parse(...), or simply
lines from a file handle.
This is a generator function, and it returns lists of the
entries from the supplied iterator. Each list will have
batch_size entries, although the final list may be shorter.
"""
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = iterator.next()
except StopIteration:
entry = None
if entry is None:
# End of file
break
batch.append(entry)
if batch:
yield batch
record_iter=SeqIO.parse(open('/path/sorted_sequences.fa'), 'fasta')
for i, batch in enumerate (batch_iterator(record_iter, 93)):
filename='gene_%i.fasta' % (i + 1)
with open('/path/files/' + filename, 'w') as ouput_handle:
count=SeqIO.write(batch, ouput_handle, 'fasta')
print ('Wrote %i records to %s' % (count, filename))
它确实将文件拆分为 93 个序列,但每组 93 个文件给出 2 个文件。我看不到错误,但我猜是有一个。 还有另一种方法可以以不同的方式拆分大型 fasta 文件吗? 谢谢
阅读示例中的代码后,迭代器似乎并没有按照 gene id 分隔文件,而只是将序列分成 batch_size
组,因此在您的情况下,每个文件有 93 个序列。
以防以后有人对这个脚本感兴趣。该脚本按原样完美运行。问题是我试图分割的文件的序列比它应该的多。所以我删除了坏文件,并生成了一个与上面的脚本很好地分割的新文件。