如何使用特定步长使用window提取短序列？

Question

下面的代码在每个 window 大小为 4 的序列中提取短序列。如何将 window 移动步长 2 并提取 4 个碱基对？

示例代码

from Bio import SeqIO

with open("testA_out.fasta","w") as f:
        for seq_record in SeqIO.parse("testA.fasta", "fasta"):
            i = 0
            while ((i+4) < len(seq_record.seq)) :
              f.write(">" + str(seq_record.id) + "\n")
              f.write(str(seq_record.seq[i:i+4]) + "\n")
              i += 2

testA.fasta

的示例输入

>human1
ACCCGATTT

testA_out

的示例输出

>human1
ACCC
>human1
CCGA
>human1
GATT

这个输出的问题是遗漏了一个 T 所以在这种情况下我希望也包括它。我怎样才能得出这个输出？使用反向提取以及包括从头到尾提取时可能遗漏的碱基对。谁能帮帮我？

预期输出

>human1
ACCC
>human1
CCGA
>human1
GATT
>human1
ATTT
>human1
CGAT    
>human1
CCCG

Answer 1

对于任何 window 尺寸和任何步长：

fasta='ACCCGATTT'
windowSize=4
step=1
i=0
while (i+windowSize)<=len(fasta):
    currentWindow=fasta[i:i+windowSize]
    print(currentWindow)
    i+=step

输出window大小=4，步长=2：

ACCC 
CCGA                       
GATT

输出window大小=4，步长=1：

ACCC                       
CCCG
CCGA
CGAT
GATT
ATTT

最后一个与"Expected output"完全相同，排序不同。

Answer 2

您可以对 range 使用 for 循环，对 range 使用第三个 step 参数。这样，它比使用 while 循环更简洁。如果数据不能被分块大小，那么最后一个分块会更小。

data = "ACCCGATTT"
step = 2
chunk = 4
for i in range(0, len(data) - step, step):
    print(data[i:i+chunk])

输出是

ACCC
CCGA
GATT
TTT

Answer 3

您的特定示例可以通过将步长改为 1 来解决。不过你的问题好像是问，"how do I repeat with the same window size from the end of the sequence if there are not enough characters in the sequence"。因此，这会有所作为的一个例子可能是

AAAATTT

window 大小为 6，步长为 2，您希望 "forward" 方向的 AAAATT 和 [=38= 方向的 AAATTT ]方向，但没有其他子序列。

显然，运行代码一次在向前方向和一次在向后方向就可以做到这一点，但它引入了重复，这通常不是一件好事。但是，您可以重构问题，以便将步骤分成几对步骤。

对于长度为 x 且步长为 y, 的序列，您可以除以 y 进入 x%y 和 y-(x%y) 并继续这些成对的步骤。（当 x%y == 0 时跳过对中的第一个成员。）

我只发布字符串处理函数，因为 none 完全特定于基因序列。

seq = "AAAATTT"
window = 6
step = 2

length = len(seq)
modulo = length % step

for i in range(0, length-window, step):
    if modulo > 0:
        print(seq[i:i+window])
    print(seq[i+modulo:i+modulo+window])

如何使用特定步长使用window提取短序列？

How to extract short sequence using window with specific step size?

python

extract

extraction

fasta

biopython