如何在分析上游和下游侧翼区域的同时遍历字符串的各个部分?
How to iterate over sections of a string while also analyzing the up and downstream flanking regions?
我想在 python 中进行滑动 window 以 120 个碱基对的框架检查 DNA 序列(长度在 2000 到 4000 个碱基对之间) .但是,我还想考虑 120 个碱基对框架上下游区域侧翼的大约 20 个核苷酸。但是,例如,如果滑动 window 移动到 2000 个碱基对长的 DNA 序列中的位置 14 或位置 1992,那么显然上游或下游侧翼区域必须少于 20 个碱基对长.
到目前为止,我的代码是这样设计的:
import from Bio import SeqIO
from Bio.Alphabet.IUPAC import IUPACUnambiguousDNA
fasta= SeqIO.to_dict(SeqIO.parse("RD4.fasta", "fasta", alphabet=IUPACUnambiguousDNA()))
sequence= DNA_sequence.values()[0].seq
print(sequence)
sequence= "TGTGAATTCATACAAGCCGTAGTCGTGCAGAAGCGCAACACTCTTGGAGTGGCCTACAACGGCGCTCTCCGCGGCGCGGGCGTACCGGATATCTTAGCTGGTCAATAGCCATTTTTCAGCAATTTCTCAGTAACGCTACGGG"
target_length= 120
for position in range(len(sequence)-target_length+1):
stop= position+target_length
potential_target_frame= sequence[position:stop]
potential_target_frame= str(potential_target)
if position < 20:
upstream_flank= sequence[:position]
downstream_flank= sequence[stop:stop+20]
elif len(sequence) - stop < 20:
upstream_flank= sequence[position-20:position]
downstream_flank= sequence[stop:]
else:
upstream_flank= sequence[position-20:position]
downstream_flank= sequence[stop:stop+20]
print("upstream flank is " + upstream_flank)
print("downstream flank is " + downstream_flank)
虽然这段代码表面上是按逻辑设计的,但打印功能表明这段代码的设计方式存在问题——只打印下游侧翼,而不打印上游侧翼。
是我的条件树设置有问题,还是我切割原始序列的方式有问题?
原来是我错误地设置了条件树。因为我正在处理字符串的两个不同部分,并且因为这两个部分可能存在于三种不同的状态(长度大于 20、小于 20 或等于 0),所以必须有 3^2 个部分我的条件树。在上游或下游侧翼的长度为零的情况下,我将其变量设置为空字符串。
代码应该是这样设置的(我从上面设置的代码中稍微压缩了它,并更改了上游和下游部分的计算方式):
target_length= 120
for position in range(len(sequence)-target_length+1):
stop= position+target_length
potential_target_frame= sequence[position:stop]
potential_target_frame= str(potential_target)
if len(sequence[:pos]) == 0 and len(sequence[stop:]) > 20:
upstream_flank= " "
downstream_flank= sequence[stop:stop+20]
print("upstream flank is " + upstream_flank)
print("downstream flank is " + downstream_flank)
elif (len(sequence[:pos]) >0 and <20) and (len(sequence[stop:]) >20:
upstream_flank= sequence[:position]
downstream_flank= sequence[stop:stop+20]
print("upstream flank is " + upstream_flank)
print("downstream flank is " + downstream_flank)
############
#####Just assume the other 5 out of 8 scenarios will be written out in elif conditions in this hash section
############
else:
upstream_flank= sequence[position-20:position]
downstream_flank= sequence[stop:stop+20]
print("upstream flank is " + upstream_flank)
print("downstream flank is " + downstream_flank)
我想在 python 中进行滑动 window 以 120 个碱基对的框架检查 DNA 序列(长度在 2000 到 4000 个碱基对之间) .但是,我还想考虑 120 个碱基对框架上下游区域侧翼的大约 20 个核苷酸。但是,例如,如果滑动 window 移动到 2000 个碱基对长的 DNA 序列中的位置 14 或位置 1992,那么显然上游或下游侧翼区域必须少于 20 个碱基对长.
到目前为止,我的代码是这样设计的:
import from Bio import SeqIO
from Bio.Alphabet.IUPAC import IUPACUnambiguousDNA
fasta= SeqIO.to_dict(SeqIO.parse("RD4.fasta", "fasta", alphabet=IUPACUnambiguousDNA()))
sequence= DNA_sequence.values()[0].seq
print(sequence)
sequence= "TGTGAATTCATACAAGCCGTAGTCGTGCAGAAGCGCAACACTCTTGGAGTGGCCTACAACGGCGCTCTCCGCGGCGCGGGCGTACCGGATATCTTAGCTGGTCAATAGCCATTTTTCAGCAATTTCTCAGTAACGCTACGGG"
target_length= 120
for position in range(len(sequence)-target_length+1):
stop= position+target_length
potential_target_frame= sequence[position:stop]
potential_target_frame= str(potential_target)
if position < 20:
upstream_flank= sequence[:position]
downstream_flank= sequence[stop:stop+20]
elif len(sequence) - stop < 20:
upstream_flank= sequence[position-20:position]
downstream_flank= sequence[stop:]
else:
upstream_flank= sequence[position-20:position]
downstream_flank= sequence[stop:stop+20]
print("upstream flank is " + upstream_flank)
print("downstream flank is " + downstream_flank)
虽然这段代码表面上是按逻辑设计的,但打印功能表明这段代码的设计方式存在问题——只打印下游侧翼,而不打印上游侧翼。
是我的条件树设置有问题,还是我切割原始序列的方式有问题?
原来是我错误地设置了条件树。因为我正在处理字符串的两个不同部分,并且因为这两个部分可能存在于三种不同的状态(长度大于 20、小于 20 或等于 0),所以必须有 3^2 个部分我的条件树。在上游或下游侧翼的长度为零的情况下,我将其变量设置为空字符串。
代码应该是这样设置的(我从上面设置的代码中稍微压缩了它,并更改了上游和下游部分的计算方式):
target_length= 120
for position in range(len(sequence)-target_length+1):
stop= position+target_length
potential_target_frame= sequence[position:stop]
potential_target_frame= str(potential_target)
if len(sequence[:pos]) == 0 and len(sequence[stop:]) > 20:
upstream_flank= " "
downstream_flank= sequence[stop:stop+20]
print("upstream flank is " + upstream_flank)
print("downstream flank is " + downstream_flank)
elif (len(sequence[:pos]) >0 and <20) and (len(sequence[stop:]) >20:
upstream_flank= sequence[:position]
downstream_flank= sequence[stop:stop+20]
print("upstream flank is " + upstream_flank)
print("downstream flank is " + downstream_flank)
############
#####Just assume the other 5 out of 8 scenarios will be written out in elif conditions in this hash section
############
else:
upstream_flank= sequence[position-20:position]
downstream_flank= sequence[stop:stop+20]
print("upstream flank is " + upstream_flank)
print("downstream flank is " + downstream_flank)