Bio.SeqUtils.molecular_weight() 函数在 molecular_weight 间隔内打印序列时出错
Error in the Bio.SeqUtils.molecular_weight() function to print sequences within a molecular_weight interval
我正在尝试在 python 中创建一个函数,给定一个(无)歧义序列和一个分子量区间 returns 该序列表示的所有明确序列的列表。
我尝试使用以下代码:
def extend_ambiguous_dna(file_name, mw_min, mw_max):
with open(file_name) as seq_file:
for record in SeqIO.parse(seq_file, "fasta"):
d = Seq.IUPACData.ambiguous_dna_values
mol_weight= Bio.SeqUtils.molecular_weight(record.seq)
for mol_weight in range(mw_min,mw_max):
print(list(map("".join, product(*map(d.get, record)))))
extend_ambiguous_dna('short.fasta')
当我 运行 时,我得到 molecular_weight 函数的错误:'D' 不是 DNA 的有效明确字母。
这是我的 fasta 文件,名为 'short.fasta':
>seq_7009 random sequence
DGRGGGWAVCVAACGTTGAT
>seq_418 random sequence
GAGCTGVTATST
>seq_9143_unamb random sequence
ACCGTTAAGCCTTAG
>seq_2888 random sequence
RVCCWDGARATAGBCGC
>seq_1101 random sequence
CSAATGYGATNBTA
>seq_107 random sequence
WGDGHGCDCTYANGTTWCA
>seq_6946 random sequence
TCVMBRAGRSGTCCAWA
>seq_6162 random sequence
YWBGCKTGCCAAGCGCDG
>seq_504 random sequence
ADDTAACCCTCTTKA
>seq_3535 random sequence
KKGTACACCAG
>seq_4077 random sequence
SRWSCRTTRVAGDCC
> seq_1626_unamb random sequence
GGATATTACCTA
我是Python的新手,但我希望有人能帮助我。
一个问题是您首先尝试计算模糊序列的权重:
mol_weight= Bio.SeqUtils.molecular_weight(record.seq)
。这行不通。您甚至不使用结果 mol_weight
,因为 for mol_weight in range(mw_min,mw_max):
行实际上覆盖了 mol_weight
.
的值
你想做的是
- 计算每个创建的明确序列的权重
- Return 仅那些权重在
mw_min
和 mw_max
之间的序列。
from itertools import product
import Bio
from Bio import SeqUtils, SeqIO
from Bio.Data import IUPACData
def extend_ambiguous_dna(file_name, mw_min, mw_max):
with open(file_name) as seq_file:
for record in SeqIO.parse(seq_file, "fasta"):
d = IUPACData.ambiguous_dna_values
print(record.seq)
ambiguous_dna = list(map("".join, product(*map(d.get, record))))
result = {}
for seq in ambiguous_dna:
weight = Bio.SeqUtils.molecular_weight(seq)
if mw_min <= weight <= mw_max:
result[seq] = weight
print(result)
if __name__ == '__main__':
extend_ambiguous_dna("short.fasta", 0, 10000000)
输出:
DGRGGGWAVCVAACGTTGAT
{'AGAGGGAAACAAACGTTGAT': 6303.052399999999, 'AGAGGGAAACCAACGTTGAT': 6279.027699999999, 'AGAGGGAAACGAACGTTGAT': 6319.051799999999, 'AGAGGGAACCAAACGTTGAT': 6279.027699999999, 'AGAGGGAACCCAACGTTGAT': 6255.002999999999, 'AGAGGGAACCGAACGTTGAT': 6295.027099999999, 'AGAGGGAAGCAAACGTTGAT': 6319.051799999999, 'AGAGGGAAGCCAACGTTGAT': 6295.027099999999, 'AGAGGGAAGCGAACGTTGAT': 6335.051199999998, 'AGAGGGTAACAAACGTTGAT': 6294.039099999998, 'AGAGGGTAACCAACGTTGAT': 6270.014399999998, 'AGAGGGTAACGAACGTTGAT': 6310.038499999999, 'AGAGGGTACCAAACGTTGAT': 6270.014399999998, 'AGAGGGTACCCAACGTTGAT': 6245.989699999998, 'AGAGGGTACCGAACGTTGAT': 6286.013799999999, 'AGAGGGTAGCAAACGTTGAT': 6310.038499999999, 'AGAGGGTAGCCAACGTTGAT': 6286.013799999999, 'AGAGGGTAGCGAACGTTGAT': 6326.037899999999, 'AGGGGGAAACAAACGTTGAT': 6319.051799999999, 'AGGGGGAAACCAACGTTGAT': 6295.027099999999, 'AGGGGGAAACGAACGTTGAT': 6335.051199999998, 'AGGGGGAACCAAACGTTGAT': 6295.027099999999, 'AGGGGGAACCCAACGTTGAT': 6271.002399999999, 'AGGGGGAACCGAACGTTGAT': 6311.026499999998, 'AGGGGGAAGCAAACGTTGAT': 6335.051199999998, 'AGGGGGAAGCCAACGTTGAT': 6311.026499999998, 'AGGGGGAAGCGAACGTTGAT': 6351.050599999999, 'AGGGGGTAACAAACGTTGAT': 6310.038499999999, 'AGGGGGTAACCAACGTTGAT': 6286.013799999999, 'AGGGGGTAACGAACGTTGAT': 6326.037899999999, 'AGGGGGTACCAAACGTTGAT': 6286.013799999999, 'AGGGGGTACCCAACGTTGAT': 6261.989099999999, 'AGGGGGTACCGAACGTTGAT': 6302.013199999999, 'AGGGGGTAGCAAACGTTGAT': 6326.037899999999, 'AGGGGGTAGCCAACGTTGAT': 6302.013199999999, 'AGGGGGTAGCGAACGTTGAT': 6342.0373, 'GGAGGGAAACAAACGTTGAT': 6319.051799999999, 'GGAGGGAAACCAACGTTGAT': 6295.027099999999, 'GGAGGGAAACGAACGTTGAT': 6335.051199999998, 'GGAGGGAACCAAACGTTGAT': 6295.027099999999, 'GGAGGGAACCCAACGTTGAT': 6271.002399999999, 'GGAGGGAACCGAACGTTGAT': 6311.026499999998, 'GGAGGGAAGCAAACGTTGAT': 6335.051199999998, 'GGAGGGAAGCCAACGTTGAT': 6311.026499999998, 'GGAGGGAAGCGAACGTTGAT': 6351.050599999999, 'GGAGGGTAACAAACGTTGAT': 6310.038499999999, 'GGAGGGTAACCAACGTTGAT': 6286.013799999999, 'GGAGGGTAACGAACGTTGAT': 6326.037899999999, 'GGAGGGTACCAAACGTTGAT': 6286.013799999999, 'GGAGGGTACCCAACGTTGAT': 6261.989099999999, 'GGAGGGTACCGAACGTTGAT': 6302.013199999999, 'GGAGGGTAGCAAACGTTGAT': 6326.037899999999, 'GGAGGGTAGCCAACGTTGAT': 6302.013199999999, 'GGAGGGTAGCGAACGTTGAT': 6342.0373, 'GGGGGGAAACAAACGTTGAT': 6335.051199999998, 'GGGGGGAAACCAACGTTGAT': 6311.026499999998, 'GGGGGGAAACGAACGTTGAT': 6351.050599999999, 'GGGGGGAACCAAACGTTGAT': 6311.026499999998, 'GGGGGGAACCCAACGTTGAT': 6287.001799999998, 'GGGGGGAACCGAACGTTGAT': 6327.025899999999, 'GGGGGGAAGCAAACGTTGAT': 6351.050599999999, 'GGGGGGAAGCCAACGTTGAT': 6327.025899999999, 'GGGGGGAAGCGAACGTTGAT': 6367.049999999999, 'GGGGGGTAACAAACGTTGAT': 6326.037899999999, 'GGGGGGTAACCAACGTTGAT': 6302.013199999999, 'GGGGGGTAACGAACGTTGAT': 6342.0373, 'GGGGGGTACCAAACGTTGAT': 6302.013199999999, 'GGGGGGTACCCAACGTTGAT': 6277.9884999999995, 'GGGGGGTACCGAACGTTGAT': 6318.0126, 'GGGGGGTAGCAAACGTTGAT': 6342.0373, 'GGGGGGTAGCCAACGTTGAT': 6318.0126, 'GGGGGGTAGCGAACGTTGAT': 6358.036699999999, 'TGAGGGAAACAAACGTTGAT': 6294.039099999998, 'TGAGGGAAACCAACGTTGAT': 6270.014399999998, 'TGAGGGAAACGAACGTTGAT': 6310.038499999999, 'TGAGGGAACCAAACGTTGAT': 6270.014399999998, 'TGAGGGAACCCAACGTTGAT': 6245.989699999998, 'TGAGGGAACCGAACGTTGAT': 6286.013799999999, 'TGAGGGAAGCAAACGTTGAT': 6310.038499999999, 'TGAGGGAAGCCAACGTTGAT': 6286.013799999999, 'TGAGGGAAGCGAACGTTGAT': 6326.037899999999, 'TGAGGGTAACAAACGTTGAT': 6285.025799999999, 'TGAGGGTAACCAACGTTGAT': 6261.0010999999995, 'TGAGGGTAACGAACGTTGAT': 6301.025199999998, 'TGAGGGTACCAAACGTTGAT': 6261.0010999999995, 'TGAGGGTACCCAACGTTGAT': 6236.9764, 'TGAGGGTACCGAACGTTGAT': 6277.000499999998, 'TGAGGGTAGCAAACGTTGAT': 6301.025199999998, 'TGAGGGTAGCCAACGTTGAT': 6277.000499999998, 'TGAGGGTAGCGAACGTTGAT': 6317.024599999999, 'TGGGGGAAACAAACGTTGAT': 6310.038499999999, 'TGGGGGAAACCAACGTTGAT': 6286.013799999999, 'TGGGGGAAACGAACGTTGAT': 6326.037899999999, 'TGGGGGAACCAAACGTTGAT': 6286.013799999999, 'TGGGGGAACCCAACGTTGAT': 6261.989099999999, 'TGGGGGAACCGAACGTTGAT': 6302.013199999999, 'TGGGGGAAGCAAACGTTGAT': 6326.037899999999, 'TGGGGGAAGCCAACGTTGAT': 6302.013199999999, 'TGGGGGAAGCGAACGTTGAT': 6342.037299999998, 'TGGGGGTAACAAACGTTGAT': 6301.025199999998, 'TGGGGGTAACCAACGTTGAT': 6277.000499999998, 'TGGGGGTAACGAACGTTGAT': 6317.024599999999, 'TGGGGGTACCAAACGTTGAT': 6277.000499999998, 'TGGGGGTACCCAACGTTGAT': 6252.975799999998, 'TGGGGGTACCGAACGTTGAT': 6292.999899999999, 'TGGGGGTAGCAAACGTTGAT': 6317.024599999999, 'TGGGGGTAGCCAACGTTGAT': 6292.999899999999, 'TGGGGGTAGCGAACGTTGAT': 6333.023999999999}
GAGCTGVTATST
{'GAGCTGATATCT': 3740.3889000000004, 'GAGCTGATATGT': 3780.4130000000005, 'GAGCTGCTATCT': 3716.3642000000004, 'GAGCTGCTATGT': 3756.3883000000005, 'GAGCTGGTATCT': 3756.3883000000005, 'GAGCTGGTATGT': 3796.4124000000006}
ACCGTTAAGCCTTAG
{'ACCGTTAAGCCTTAG': 4631.958999999999}
[...]
我正在尝试在 python 中创建一个函数,给定一个(无)歧义序列和一个分子量区间 returns 该序列表示的所有明确序列的列表。 我尝试使用以下代码:
def extend_ambiguous_dna(file_name, mw_min, mw_max):
with open(file_name) as seq_file:
for record in SeqIO.parse(seq_file, "fasta"):
d = Seq.IUPACData.ambiguous_dna_values
mol_weight= Bio.SeqUtils.molecular_weight(record.seq)
for mol_weight in range(mw_min,mw_max):
print(list(map("".join, product(*map(d.get, record)))))
extend_ambiguous_dna('short.fasta')
当我 运行 时,我得到 molecular_weight 函数的错误:'D' 不是 DNA 的有效明确字母。
这是我的 fasta 文件,名为 'short.fasta':
>seq_7009 random sequence
DGRGGGWAVCVAACGTTGAT
>seq_418 random sequence
GAGCTGVTATST
>seq_9143_unamb random sequence
ACCGTTAAGCCTTAG
>seq_2888 random sequence
RVCCWDGARATAGBCGC
>seq_1101 random sequence
CSAATGYGATNBTA
>seq_107 random sequence
WGDGHGCDCTYANGTTWCA
>seq_6946 random sequence
TCVMBRAGRSGTCCAWA
>seq_6162 random sequence
YWBGCKTGCCAAGCGCDG
>seq_504 random sequence
ADDTAACCCTCTTKA
>seq_3535 random sequence
KKGTACACCAG
>seq_4077 random sequence
SRWSCRTTRVAGDCC
> seq_1626_unamb random sequence
GGATATTACCTA
我是Python的新手,但我希望有人能帮助我。
一个问题是您首先尝试计算模糊序列的权重:
mol_weight= Bio.SeqUtils.molecular_weight(record.seq)
。这行不通。您甚至不使用结果 mol_weight
,因为 for mol_weight in range(mw_min,mw_max):
行实际上覆盖了 mol_weight
.
你想做的是
- 计算每个创建的明确序列的权重
- Return 仅那些权重在
mw_min
和mw_max
之间的序列。
from itertools import product
import Bio
from Bio import SeqUtils, SeqIO
from Bio.Data import IUPACData
def extend_ambiguous_dna(file_name, mw_min, mw_max):
with open(file_name) as seq_file:
for record in SeqIO.parse(seq_file, "fasta"):
d = IUPACData.ambiguous_dna_values
print(record.seq)
ambiguous_dna = list(map("".join, product(*map(d.get, record))))
result = {}
for seq in ambiguous_dna:
weight = Bio.SeqUtils.molecular_weight(seq)
if mw_min <= weight <= mw_max:
result[seq] = weight
print(result)
if __name__ == '__main__':
extend_ambiguous_dna("short.fasta", 0, 10000000)
输出:
DGRGGGWAVCVAACGTTGAT
{'AGAGGGAAACAAACGTTGAT': 6303.052399999999, 'AGAGGGAAACCAACGTTGAT': 6279.027699999999, 'AGAGGGAAACGAACGTTGAT': 6319.051799999999, 'AGAGGGAACCAAACGTTGAT': 6279.027699999999, 'AGAGGGAACCCAACGTTGAT': 6255.002999999999, 'AGAGGGAACCGAACGTTGAT': 6295.027099999999, 'AGAGGGAAGCAAACGTTGAT': 6319.051799999999, 'AGAGGGAAGCCAACGTTGAT': 6295.027099999999, 'AGAGGGAAGCGAACGTTGAT': 6335.051199999998, 'AGAGGGTAACAAACGTTGAT': 6294.039099999998, 'AGAGGGTAACCAACGTTGAT': 6270.014399999998, 'AGAGGGTAACGAACGTTGAT': 6310.038499999999, 'AGAGGGTACCAAACGTTGAT': 6270.014399999998, 'AGAGGGTACCCAACGTTGAT': 6245.989699999998, 'AGAGGGTACCGAACGTTGAT': 6286.013799999999, 'AGAGGGTAGCAAACGTTGAT': 6310.038499999999, 'AGAGGGTAGCCAACGTTGAT': 6286.013799999999, 'AGAGGGTAGCGAACGTTGAT': 6326.037899999999, 'AGGGGGAAACAAACGTTGAT': 6319.051799999999, 'AGGGGGAAACCAACGTTGAT': 6295.027099999999, 'AGGGGGAAACGAACGTTGAT': 6335.051199999998, 'AGGGGGAACCAAACGTTGAT': 6295.027099999999, 'AGGGGGAACCCAACGTTGAT': 6271.002399999999, 'AGGGGGAACCGAACGTTGAT': 6311.026499999998, 'AGGGGGAAGCAAACGTTGAT': 6335.051199999998, 'AGGGGGAAGCCAACGTTGAT': 6311.026499999998, 'AGGGGGAAGCGAACGTTGAT': 6351.050599999999, 'AGGGGGTAACAAACGTTGAT': 6310.038499999999, 'AGGGGGTAACCAACGTTGAT': 6286.013799999999, 'AGGGGGTAACGAACGTTGAT': 6326.037899999999, 'AGGGGGTACCAAACGTTGAT': 6286.013799999999, 'AGGGGGTACCCAACGTTGAT': 6261.989099999999, 'AGGGGGTACCGAACGTTGAT': 6302.013199999999, 'AGGGGGTAGCAAACGTTGAT': 6326.037899999999, 'AGGGGGTAGCCAACGTTGAT': 6302.013199999999, 'AGGGGGTAGCGAACGTTGAT': 6342.0373, 'GGAGGGAAACAAACGTTGAT': 6319.051799999999, 'GGAGGGAAACCAACGTTGAT': 6295.027099999999, 'GGAGGGAAACGAACGTTGAT': 6335.051199999998, 'GGAGGGAACCAAACGTTGAT': 6295.027099999999, 'GGAGGGAACCCAACGTTGAT': 6271.002399999999, 'GGAGGGAACCGAACGTTGAT': 6311.026499999998, 'GGAGGGAAGCAAACGTTGAT': 6335.051199999998, 'GGAGGGAAGCCAACGTTGAT': 6311.026499999998, 'GGAGGGAAGCGAACGTTGAT': 6351.050599999999, 'GGAGGGTAACAAACGTTGAT': 6310.038499999999, 'GGAGGGTAACCAACGTTGAT': 6286.013799999999, 'GGAGGGTAACGAACGTTGAT': 6326.037899999999, 'GGAGGGTACCAAACGTTGAT': 6286.013799999999, 'GGAGGGTACCCAACGTTGAT': 6261.989099999999, 'GGAGGGTACCGAACGTTGAT': 6302.013199999999, 'GGAGGGTAGCAAACGTTGAT': 6326.037899999999, 'GGAGGGTAGCCAACGTTGAT': 6302.013199999999, 'GGAGGGTAGCGAACGTTGAT': 6342.0373, 'GGGGGGAAACAAACGTTGAT': 6335.051199999998, 'GGGGGGAAACCAACGTTGAT': 6311.026499999998, 'GGGGGGAAACGAACGTTGAT': 6351.050599999999, 'GGGGGGAACCAAACGTTGAT': 6311.026499999998, 'GGGGGGAACCCAACGTTGAT': 6287.001799999998, 'GGGGGGAACCGAACGTTGAT': 6327.025899999999, 'GGGGGGAAGCAAACGTTGAT': 6351.050599999999, 'GGGGGGAAGCCAACGTTGAT': 6327.025899999999, 'GGGGGGAAGCGAACGTTGAT': 6367.049999999999, 'GGGGGGTAACAAACGTTGAT': 6326.037899999999, 'GGGGGGTAACCAACGTTGAT': 6302.013199999999, 'GGGGGGTAACGAACGTTGAT': 6342.0373, 'GGGGGGTACCAAACGTTGAT': 6302.013199999999, 'GGGGGGTACCCAACGTTGAT': 6277.9884999999995, 'GGGGGGTACCGAACGTTGAT': 6318.0126, 'GGGGGGTAGCAAACGTTGAT': 6342.0373, 'GGGGGGTAGCCAACGTTGAT': 6318.0126, 'GGGGGGTAGCGAACGTTGAT': 6358.036699999999, 'TGAGGGAAACAAACGTTGAT': 6294.039099999998, 'TGAGGGAAACCAACGTTGAT': 6270.014399999998, 'TGAGGGAAACGAACGTTGAT': 6310.038499999999, 'TGAGGGAACCAAACGTTGAT': 6270.014399999998, 'TGAGGGAACCCAACGTTGAT': 6245.989699999998, 'TGAGGGAACCGAACGTTGAT': 6286.013799999999, 'TGAGGGAAGCAAACGTTGAT': 6310.038499999999, 'TGAGGGAAGCCAACGTTGAT': 6286.013799999999, 'TGAGGGAAGCGAACGTTGAT': 6326.037899999999, 'TGAGGGTAACAAACGTTGAT': 6285.025799999999, 'TGAGGGTAACCAACGTTGAT': 6261.0010999999995, 'TGAGGGTAACGAACGTTGAT': 6301.025199999998, 'TGAGGGTACCAAACGTTGAT': 6261.0010999999995, 'TGAGGGTACCCAACGTTGAT': 6236.9764, 'TGAGGGTACCGAACGTTGAT': 6277.000499999998, 'TGAGGGTAGCAAACGTTGAT': 6301.025199999998, 'TGAGGGTAGCCAACGTTGAT': 6277.000499999998, 'TGAGGGTAGCGAACGTTGAT': 6317.024599999999, 'TGGGGGAAACAAACGTTGAT': 6310.038499999999, 'TGGGGGAAACCAACGTTGAT': 6286.013799999999, 'TGGGGGAAACGAACGTTGAT': 6326.037899999999, 'TGGGGGAACCAAACGTTGAT': 6286.013799999999, 'TGGGGGAACCCAACGTTGAT': 6261.989099999999, 'TGGGGGAACCGAACGTTGAT': 6302.013199999999, 'TGGGGGAAGCAAACGTTGAT': 6326.037899999999, 'TGGGGGAAGCCAACGTTGAT': 6302.013199999999, 'TGGGGGAAGCGAACGTTGAT': 6342.037299999998, 'TGGGGGTAACAAACGTTGAT': 6301.025199999998, 'TGGGGGTAACCAACGTTGAT': 6277.000499999998, 'TGGGGGTAACGAACGTTGAT': 6317.024599999999, 'TGGGGGTACCAAACGTTGAT': 6277.000499999998, 'TGGGGGTACCCAACGTTGAT': 6252.975799999998, 'TGGGGGTACCGAACGTTGAT': 6292.999899999999, 'TGGGGGTAGCAAACGTTGAT': 6317.024599999999, 'TGGGGGTAGCCAACGTTGAT': 6292.999899999999, 'TGGGGGTAGCGAACGTTGAT': 6333.023999999999}
GAGCTGVTATST
{'GAGCTGATATCT': 3740.3889000000004, 'GAGCTGATATGT': 3780.4130000000005, 'GAGCTGCTATCT': 3716.3642000000004, 'GAGCTGCTATGT': 3756.3883000000005, 'GAGCTGGTATCT': 3756.3883000000005, 'GAGCTGGTATGT': 3796.4124000000006}
ACCGTTAAGCCTTAG
{'ACCGTTAAGCCTTAG': 4631.958999999999}
[...]