Python: 如何从 FASTA 文件中滑动 window 打印出长度为 n 的序列？

Question

我有一个包含少量序列的 fasta 文件，我想执行 window 大小 5 的滑动 windows 并在扫描序列时提取序列。

例如 ( test1.fasta ):
>human1
ATCGCGTC
>human2
ATTTTCGCGA

预期输出（test1_out.txt）：
>human1
ATCGC
>human1
TCGCG
>human1
CGCGT
>human1
GCGTC
>human2
ATTTT
>human2
TTTTC
>human2
TTTCG
>human2
TTCGC
>human2
TCGCG
>human2
CGCGA

我下面的代码只能提取前五个碱基对。如何移动 window 以在 window 大小为 5 的每个步长 1 中提取 5 bp？

from Bio import SeqIO

with open("test1_out.txt","w") as f:
            for seq_record in SeqIO.parse("test1.fasta", "fasta"):

            f.write(str(seq_record.id) + "\n")
            f.write(str(seq_record.seq[:5]) + "\n")  #first 5 base positions

以上代码是我从Whosebug中的其他例子中得到的*

Answer 1

所以我猜 "seq_record.seq" 是人类 1.14=] 中的整个 DNA 序列。你可以这样写：

from Bio import SeqIO

with open("test1_out.txt","w") as f:
        for seq_record in SeqIO.parse("test1.fasta", "fasta"):
            for i in range(len(seq_record.seq) - 4) :
               f.write(str(seq_record.id) + "\n")
               f.write(str(seq_record.seq[i:i+5]) + "\n")  #first 5 base positions

Python: 如何从 FASTA 文件中滑动 window 打印出长度为 n 的序列？

Python: How to print out sequences with length n from sliding window in FASTA file?

python

fasta

biopython

python-2.7

python-3.x