如何将 multi-fasta 文件拆分为序列长度相等的块并使用 biopython 更改 headers
How to split a multi-fasta file into chunks of equal sequence length AND change the headers using biopython
首先,我为我的 pythonic 无知道歉。我需要将我的 multi-sequence fasta 文件分解成大小相等的块,用于下游管道。我还没有 运行 任何能轻松做到这一点或采用我正在寻找的格式的东西。
一个示例 fasta 文件输入:
original.fas
>重叠群1
ACGTA
>重叠群2
GGGATAGTCA
>重叠群3
GACTACTTTT
上面的例子fasta有25bp。如果我将 "chunk number" 参数设置为“4”,那么我希望我的输出文件都有 7 个碱基对,除了最后一个文件有剩余的 4bp。它看起来像这样:
chunk1.fas
>contig1:0-4
ACGTA
>contig2:0-1
GG
chunk2.fas
>contig2:2-7
GATAGTC
chunk3.fas
>contig2:9-9
一个
>contig3:0-5
GACTAC
chunk4.fas
>contig3:6-9
TTTT
注意每个生成的块*.fas 包括 7 个碱基对,chunk4.fas 中剩余的碱基对除外。此外,块文件中的每个结果序列 header 都与原始序列不同,因此它们包含一个“:”以及从原始序列派生的 "start" 和 "stop" 位置。
biopython cookbook 有一个非常好的批量大小迭代器工具,我认为我的答案在于操纵这段代码,但我不知道如何去做。
感谢任何帮助。干杯。
def batch_iterator(iterator, batch_size):
"""Returns lists of length batch_size.
This can be used on any iterator, for example to batch up
SeqRecord objects from Bio.SeqIO.parse(...), or to batch
Alignment objects from Bio.AlignIO.parse(...), or simply
lines from a file handle.
This is a generator function, and it returns lists of the
entries from the supplied iterator. Each list will have
batch_size entries, although the final list may be shorter.
"""
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = next(iterator)
except StopIteration:
entry = False
if not entry:
# End of file
break
batch.append(entry)
if batch:
yield batch
record_iter = SeqIO.parse('aVan.fa', 'fasta')
for i, batch in enumerate(batch_iterator(record_iter, 1000), start=1):
filename = 'group_{}.fasta'.format(i)
count = SeqIO.write(batch, filename, 'fasta')
print('Wrote {} records to {}'.format(count, filename))
这不是一件容易的事,但看看这个实现:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
chunk_number = 4
records = list(SeqIO.parse("input.fasta", "fasta"))
chunk_size = sum(len(r) for r in records) // chunk_number + 1
def create_batch(records, chunk_size):
record_it = iter(records)
record = next(record_it)
current_base = 0
batch = []
batch_size = 0
# While there are new records, keep creating new batches.
while record:
# Loop over records untill the batch is full. (or no new records)
while batch_size != chunk_size and record:
end = current_base + chunk_size - batch_size
seq = record[current_base:end]
end_of_slice = current_base + len(seq) - 1
fasta_header = record.id + ":{}-{}".format(current_base, end_of_slice)
seq.id = seq.name = fasta_header
seq.description = ''
batch.append(seq)
current_base += len(seq)
batch_size += len(seq)
# Current record is exhausted, get a new one.
if current_base >= len(record):
record = next(record_it, None)
current_base = 0
# We have a batch with the correct size (or no new bathces)
yield batch
batch = []
batch_size = 0
for i, batch in enumerate(create_batch(records, chunk_size)):
filename = "chunk{}.fasta".format(i)
SeqIO.write(batch, filename, "fasta")
网上有好工具FASTA splitter
现在:)
首先,我为我的 pythonic 无知道歉。我需要将我的 multi-sequence fasta 文件分解成大小相等的块,用于下游管道。我还没有 运行 任何能轻松做到这一点或采用我正在寻找的格式的东西。
一个示例 fasta 文件输入:
original.fas
>重叠群1ACGTA
>重叠群2GGGATAGTCA
>重叠群3GACTACTTTT
上面的例子fasta有25bp。如果我将 "chunk number" 参数设置为“4”,那么我希望我的输出文件都有 7 个碱基对,除了最后一个文件有剩余的 4bp。它看起来像这样:
chunk1.fas
>contig1:0-4ACGTA
>contig2:0-1GG
chunk2.fas
>contig2:2-7GATAGTC
chunk3.fas
>contig2:9-9一个
>contig3:0-5GACTAC
chunk4.fas
>contig3:6-9TTTT
注意每个生成的块*.fas 包括 7 个碱基对,chunk4.fas 中剩余的碱基对除外。此外,块文件中的每个结果序列 header 都与原始序列不同,因此它们包含一个“:”以及从原始序列派生的 "start" 和 "stop" 位置。
biopython cookbook 有一个非常好的批量大小迭代器工具,我认为我的答案在于操纵这段代码,但我不知道如何去做。
感谢任何帮助。干杯。
def batch_iterator(iterator, batch_size):
"""Returns lists of length batch_size.
This can be used on any iterator, for example to batch up
SeqRecord objects from Bio.SeqIO.parse(...), or to batch
Alignment objects from Bio.AlignIO.parse(...), or simply
lines from a file handle.
This is a generator function, and it returns lists of the
entries from the supplied iterator. Each list will have
batch_size entries, although the final list may be shorter.
"""
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = next(iterator)
except StopIteration:
entry = False
if not entry:
# End of file
break
batch.append(entry)
if batch:
yield batch
record_iter = SeqIO.parse('aVan.fa', 'fasta')
for i, batch in enumerate(batch_iterator(record_iter, 1000), start=1):
filename = 'group_{}.fasta'.format(i)
count = SeqIO.write(batch, filename, 'fasta')
print('Wrote {} records to {}'.format(count, filename))
这不是一件容易的事,但看看这个实现:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
chunk_number = 4
records = list(SeqIO.parse("input.fasta", "fasta"))
chunk_size = sum(len(r) for r in records) // chunk_number + 1
def create_batch(records, chunk_size):
record_it = iter(records)
record = next(record_it)
current_base = 0
batch = []
batch_size = 0
# While there are new records, keep creating new batches.
while record:
# Loop over records untill the batch is full. (or no new records)
while batch_size != chunk_size and record:
end = current_base + chunk_size - batch_size
seq = record[current_base:end]
end_of_slice = current_base + len(seq) - 1
fasta_header = record.id + ":{}-{}".format(current_base, end_of_slice)
seq.id = seq.name = fasta_header
seq.description = ''
batch.append(seq)
current_base += len(seq)
batch_size += len(seq)
# Current record is exhausted, get a new one.
if current_base >= len(record):
record = next(record_it, None)
current_base = 0
# We have a batch with the correct size (or no new bathces)
yield batch
batch = []
batch_size = 0
for i, batch in enumerate(create_batch(records, chunk_size)):
filename = "chunk{}.fasta".format(i)
SeqIO.write(batch, filename, "fasta")
网上有好工具FASTA splitter 现在:)