计算 DNA 序列中的三联体
Counting triplets in a DNA-sequence
我想编写一个代码来计算一个序列中的所有三胞胎。到目前为止,我已经阅读了很多帖子,但其中 none 可以帮助我。
这是我的代码:
def cnt(seq):
mydict = {}
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
b = ''.join(a[(0+3*i):(3+3*i)])
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if b == triplet:
mydict[b] = 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
Biopython有没有提供解决这个问题的功能?
编辑:
注意 例如,在序列 'ATGAAG'、'TGA' 或 'GAA' 中 不是 "valid"三胞胎,只有'ATG'和'AAG',因为在生物学和生物信息学中,我们读到'ATG'和'AAG',这就是我们需要的信息翻译它或其他任何东西。
你可以把它想象成一个单词序列,例如"Hello world"。我们的阅读方式是"Hello"和"world",而不是"Hello"、"ello "、"llo w"、...
你可以这样做:
from itertools import product
seq = 'ATGATG'
all_triplets = [seq[i:i+3] for i in range(len(seq)) if i <= len(seq)-3]
# this gives ['ATG', 'TGA', 'GAT', 'ATG']
# add more valid_triplets here
valid_triplets = ['ATG']
len([(i, j) for i, j in product(valid_triplets, all_triplets) if i==j])
输出:
2
不清楚预期的输出结果。这里我们使用许多 grouping functions from more_itertools
中的一个来构建相邻的三元组或 "codons".
import more_itertools as mit
seq = "ATGATG"
codons = ["".join(w) for w in mit.grouper(3, seq)]
codons
# ['ATG', 'ATG']
调用len
.
计算密码子数
len(triplets)
# 2
要进行更详细的分析,请考虑将问题拆分为更小的函数,即 (1) 提取密码子和 (2) 计算出现次数。
代码
import collections as ct
def split_codons(seq):
"Return codons from a sequence; raise for bad sequences."
for w in mit.windowed(seq, n=3, step=3, fillvalue=""):
part = "".join(w)
if len(part) < 3:
raise ValueError(f"Sequence not divisible by 3. Got extra '{part}'.")
yield part
def count_codons(codons):
"""Return dictionary of codon occurences."""
dd = ct.defaultdict(int)
for i, c in enumerate(codons, 1):
dd[c] += 1
return {k: (v, 100 * v/i) for k, v in dd.items()}
演示
>>> seq = "ATCGCAGAAATCCGCAGAATC"
>>> bad_seq = "ATCGCAGAAATCCGCAGAATCA"
>>> list(split_codons(seq))
['ATC', 'GCA', 'GAA', 'ATC', 'CGC', 'AGA', 'ATC']
>>> list(split_codons(bad_seq))
ValueError: Sequence not divisible by 3. Got extra 'A'.
>>> count_codons(split_codons(seq))
{'ATC': (3, 42.857142857142854),
'GCA': (1, 14.285714285714286),
'GAA': (1, 14.285714285714286),
'CGC': (1, 14.285714285714286),
'AGA': (1, 14.285714285714286)}
我花了一段时间才明白你不是要计算密码子的数量,而是要计算每个密码子的频率。你的标题在这方面有点误导。无论如何,您可以雇用 collections.Counter
来完成您的任务:
from collections import Counter
def cnt(seq):
if len(seq) % 3 == 0:
#split list into codons of three
codons = [seq[i:i+3] for i in range(0, len(seq), 3)]
#create Counter dictionary for it
codon_freq = Counter(codons)
#determine number of codons, should be len(seq) // 3
n = sum(codon_freq.values())
#print out all entries in an appealing form
for key in sorted(codon_freq):
print("{}: {} = {:5.2f}%".format(key, codon_freq[key], codon_freq[key] * 100 / n))
#or just the dictionary
#print(codon_freq)
else:
print("Error")
seq = "ATCGCAGAAATCCGCAGAATC"
cnt(seq)
示例输出:
AGA: 1 = 14.29%
ATC: 3 = 42.86%
CGC: 1 = 14.29%
GAA: 1 = 14.29%
GCA: 1 = 14.29%
您可以使用其他答案中建议的巧妙技术,但我将从您的代码开始构建一个解决方案,这几乎可以正常工作:您的问题是每次 mydict[b] = 1
,您都会重置b
的计数为 1.
一个最小的修复
您可以通过测试密钥是否存在来解决此问题,如果不存在,则在字典中创建条目,然后递增值,但 python.
中有更方便的工具
对您的代码进行的最小更改是使用 defaultdict(int)
而不是 dict。每当遇到新键时,都会假定它具有与 int 相关联的默认值:0。因此您可以增加该值而不是重置:
from collections import defaultdict
def cnt(seq):
# instanciate a defaultdict that creates ints when necessary
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
b = ''.join(a[(0+3*i):(3+3*i)])
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if b == triplet:
# increment the existing count (or the default 0 value)
mydict[b] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
效果如愿:
cnt("ACTGGCACT")
ACT: 2
GGC: 1
一些可能的改进
现在让我们尝试稍微改进一下您的代码。
首先,正如我在评论中所写,让我们避免将您的序列不必要地转换为列表,并为当前计数的密码子使用更好的变量名称:
from collections import defaultdict
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if codon == triplet:
mydict[codon] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
现在让我们通过预先生成一组可能的密码子来简化嵌套循环部分,尝试所有可能的密码子:
from collections import defaultdict
from itertools import product
codons = {
"".join((base1, base2, base3))
for (base1, base2, base3) in product("ACGT", "ACGT", "ACGT")}
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
if codon in codons:
mydict[codon] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
现在,您的代码会忽略不是有效密码子的三联体。也许你应该改为发出警告:
from collections import defaultdict
from itertools import product
codons = {
"".join((base1, base2, base3))
for (base1, base2, base3) in product("ACGT", "ACGT", "ACGT")}
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
# We count even invalid triplets
mydict[codon] += 1
# We display counts only for valid triplets
for codon in sorted(codons):
print("%s: %s" % (codon, mydict[codon]))
# We compute the set of invalid triplets:
# the keys that are not codons.
invalid = mydict.keys() - codons
# An empty set has value False in a test.
# We issue a warning if the set is not empty.
if invalid:
print("Warning! There are invalid triplets:")
print(", ".join(sorted(invalid)))
else:
print("Error")
更奇特的解决方案
现在是一个更奇特的解决方案,使用 cytoolz(可能需要安装,因为它不是通常的 python 发行版的一部分:pip3 install cytoolz
,如果您使用的是 pip):
from collections import Counter
from itertools import product, repeat
from cytoolz import groupby, keymap, partition
# To make strings out of lists of strings
CAT = "".join
# The star "extracts" the elements from the result of repeat,
# so that product has 3 arguments, and not a single one
codons = {CAT(bases) for bases in product(*repeat("ACGT", 3))}
def cnt(seq):
# keymap(CAT, ...) transforms the keys (that are tuples of letters)
# into strings
# if len(seq) is not a multiple of 3, pad="-" will append "-"
# to complete the last triplet (which will be an invalid one)
codon_counts = keymap(CAT, Counter(partition(3, seq, pad="-")))
# separate encountered codons into valids and invalids
codons_by_validity = groupby(codons.__contains__, codon_counts.keys())
# get allows to provide a default value,
# in case one of the categories is not present
valids = codons_by_validity.get(True, [])
invalids = codons_by_validity.get(False, [])
# We display counts only for valid triplets
for codon in sorted(valids):
print("%s: %s" % (codon, codon_counts[codon]))
# We issue a warning if there are invalid codons.
if invalids:
print("Warning! There are invalid triplets:")
print(", ".join(sorted(invalids)))
希望这对您有所帮助。
我想编写一个代码来计算一个序列中的所有三胞胎。到目前为止,我已经阅读了很多帖子,但其中 none 可以帮助我。
这是我的代码:
def cnt(seq):
mydict = {}
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
b = ''.join(a[(0+3*i):(3+3*i)])
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if b == triplet:
mydict[b] = 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
Biopython有没有提供解决这个问题的功能?
编辑:
注意 例如,在序列 'ATGAAG'、'TGA' 或 'GAA' 中 不是 "valid"三胞胎,只有'ATG'和'AAG',因为在生物学和生物信息学中,我们读到'ATG'和'AAG',这就是我们需要的信息翻译它或其他任何东西。
你可以把它想象成一个单词序列,例如"Hello world"。我们的阅读方式是"Hello"和"world",而不是"Hello"、"ello "、"llo w"、...
你可以这样做:
from itertools import product
seq = 'ATGATG'
all_triplets = [seq[i:i+3] for i in range(len(seq)) if i <= len(seq)-3]
# this gives ['ATG', 'TGA', 'GAT', 'ATG']
# add more valid_triplets here
valid_triplets = ['ATG']
len([(i, j) for i, j in product(valid_triplets, all_triplets) if i==j])
输出:
2
不清楚预期的输出结果。这里我们使用许多 grouping functions from more_itertools
中的一个来构建相邻的三元组或 "codons".
import more_itertools as mit
seq = "ATGATG"
codons = ["".join(w) for w in mit.grouper(3, seq)]
codons
# ['ATG', 'ATG']
调用len
.
len(triplets)
# 2
要进行更详细的分析,请考虑将问题拆分为更小的函数,即 (1) 提取密码子和 (2) 计算出现次数。
代码
import collections as ct
def split_codons(seq):
"Return codons from a sequence; raise for bad sequences."
for w in mit.windowed(seq, n=3, step=3, fillvalue=""):
part = "".join(w)
if len(part) < 3:
raise ValueError(f"Sequence not divisible by 3. Got extra '{part}'.")
yield part
def count_codons(codons):
"""Return dictionary of codon occurences."""
dd = ct.defaultdict(int)
for i, c in enumerate(codons, 1):
dd[c] += 1
return {k: (v, 100 * v/i) for k, v in dd.items()}
演示
>>> seq = "ATCGCAGAAATCCGCAGAATC"
>>> bad_seq = "ATCGCAGAAATCCGCAGAATCA"
>>> list(split_codons(seq))
['ATC', 'GCA', 'GAA', 'ATC', 'CGC', 'AGA', 'ATC']
>>> list(split_codons(bad_seq))
ValueError: Sequence not divisible by 3. Got extra 'A'.
>>> count_codons(split_codons(seq))
{'ATC': (3, 42.857142857142854),
'GCA': (1, 14.285714285714286),
'GAA': (1, 14.285714285714286),
'CGC': (1, 14.285714285714286),
'AGA': (1, 14.285714285714286)}
我花了一段时间才明白你不是要计算密码子的数量,而是要计算每个密码子的频率。你的标题在这方面有点误导。无论如何,您可以雇用 collections.Counter
来完成您的任务:
from collections import Counter
def cnt(seq):
if len(seq) % 3 == 0:
#split list into codons of three
codons = [seq[i:i+3] for i in range(0, len(seq), 3)]
#create Counter dictionary for it
codon_freq = Counter(codons)
#determine number of codons, should be len(seq) // 3
n = sum(codon_freq.values())
#print out all entries in an appealing form
for key in sorted(codon_freq):
print("{}: {} = {:5.2f}%".format(key, codon_freq[key], codon_freq[key] * 100 / n))
#or just the dictionary
#print(codon_freq)
else:
print("Error")
seq = "ATCGCAGAAATCCGCAGAATC"
cnt(seq)
示例输出:
AGA: 1 = 14.29%
ATC: 3 = 42.86%
CGC: 1 = 14.29%
GAA: 1 = 14.29%
GCA: 1 = 14.29%
您可以使用其他答案中建议的巧妙技术,但我将从您的代码开始构建一个解决方案,这几乎可以正常工作:您的问题是每次 mydict[b] = 1
,您都会重置b
的计数为 1.
一个最小的修复
您可以通过测试密钥是否存在来解决此问题,如果不存在,则在字典中创建条目,然后递增值,但 python.
中有更方便的工具对您的代码进行的最小更改是使用 defaultdict(int)
而不是 dict。每当遇到新键时,都会假定它具有与 int 相关联的默认值:0。因此您可以增加该值而不是重置:
from collections import defaultdict
def cnt(seq):
# instanciate a defaultdict that creates ints when necessary
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
b = ''.join(a[(0+3*i):(3+3*i)])
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if b == triplet:
# increment the existing count (or the default 0 value)
mydict[b] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
效果如愿:
cnt("ACTGGCACT")
ACT: 2
GGC: 1
一些可能的改进
现在让我们尝试稍微改进一下您的代码。
首先,正如我在评论中所写,让我们避免将您的序列不必要地转换为列表,并为当前计数的密码子使用更好的变量名称:
from collections import defaultdict
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if codon == triplet:
mydict[codon] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
现在让我们通过预先生成一组可能的密码子来简化嵌套循环部分,尝试所有可能的密码子:
from collections import defaultdict
from itertools import product
codons = {
"".join((base1, base2, base3))
for (base1, base2, base3) in product("ACGT", "ACGT", "ACGT")}
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
if codon in codons:
mydict[codon] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
现在,您的代码会忽略不是有效密码子的三联体。也许你应该改为发出警告:
from collections import defaultdict
from itertools import product
codons = {
"".join((base1, base2, base3))
for (base1, base2, base3) in product("ACGT", "ACGT", "ACGT")}
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
# We count even invalid triplets
mydict[codon] += 1
# We display counts only for valid triplets
for codon in sorted(codons):
print("%s: %s" % (codon, mydict[codon]))
# We compute the set of invalid triplets:
# the keys that are not codons.
invalid = mydict.keys() - codons
# An empty set has value False in a test.
# We issue a warning if the set is not empty.
if invalid:
print("Warning! There are invalid triplets:")
print(", ".join(sorted(invalid)))
else:
print("Error")
更奇特的解决方案
现在是一个更奇特的解决方案,使用 cytoolz(可能需要安装,因为它不是通常的 python 发行版的一部分:pip3 install cytoolz
,如果您使用的是 pip):
from collections import Counter
from itertools import product, repeat
from cytoolz import groupby, keymap, partition
# To make strings out of lists of strings
CAT = "".join
# The star "extracts" the elements from the result of repeat,
# so that product has 3 arguments, and not a single one
codons = {CAT(bases) for bases in product(*repeat("ACGT", 3))}
def cnt(seq):
# keymap(CAT, ...) transforms the keys (that are tuples of letters)
# into strings
# if len(seq) is not a multiple of 3, pad="-" will append "-"
# to complete the last triplet (which will be an invalid one)
codon_counts = keymap(CAT, Counter(partition(3, seq, pad="-")))
# separate encountered codons into valids and invalids
codons_by_validity = groupby(codons.__contains__, codon_counts.keys())
# get allows to provide a default value,
# in case one of the categories is not present
valids = codons_by_validity.get(True, [])
invalids = codons_by_validity.get(False, [])
# We display counts only for valid triplets
for codon in sorted(valids):
print("%s: %s" % (codon, codon_counts[codon]))
# We issue a warning if there are invalid codons.
if invalids:
print("Warning! There are invalid triplets:")
print(", ".join(sorted(invalids)))
希望这对您有所帮助。