频率加起来不为一

Question

我正在编写一个函数，它应该通过 DNA 序列的 .fasta 文件并为文件中的每个序列创建核苷酸 (nt) 和二核苷酸 (dnt) 频率的字典。然后我将每个字典存储在一个名为 "frequency" 的列表中。这是一段奇怪的代码：

for fasta in seq_file:
    freq = {}
    dna = str(fasta.seq)
    for base1 in ['A', 'T', 'G', 'C']:
        onefreq = float(dna.count(base1)) / len(dna)
        freq[base1] = onefreq
        for base2 in ['A', 'T', 'G', 'C']:
            dinucleotide = base1 + base2
            twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) 
            freq[dinucleotide] = twofreq
    frequency.append(freq)

（顺便说一句，我正在使用 biopython，这样我就不必将整个 fasta 文件提交到内存中。这也是针对 ssDNA 的，所以我不必考虑反义 dnt)

单个 nt 记录的频率相加为 1.0，但 dnt 的频率相加不为 1.0。这很奇怪，因为这两种频率的计算方法在我看来是相同的。

我将诊断打印语句和 "check" 变量留在了：

for fasta in seq_file:
    freq = {}
    dna = str(fasta.seq)
    check = 0.0
    check2 = 0.0
    for base1 in ['A', 'T', 'G', 'C']:
        onefreq = float(dna.count(base1)) / len(dna)
        freq[base1] = onefreq
        check2 += onefreq
        for base2 in ['A', 'T', 'G', 'C']:
            dinucleotide = base1 + base2
            twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) 
            check += twofreq
            print(twofreq)
            freq[dinucleotide] = twofreq
    print("\n")
    print(check, check2)
    print(len(dna))
    print("\n")
    frequency.append(freq)

获取此输出：（仅针对一个序列）

0.0894168466523 
0.0760259179266
0.0946004319654
0.0561555075594
0.0431965442765
0.0423326133909
0.0747300215983
0.0488120950324
0.0976241900648
0.0483801295896
0.0539956803456
0.0423326133909
0.0863930885529
0.0419006479482
0.0190064794816
0.031101511879


(0.9460043196544274, 1.0)
2316

在这里我们可以看到16种可能的不同dnt的频率，所有dnt频率的总和（0.946）和所有nt频率的总和（1.0）以及序列的长度。

为什么dnt频率加起来不是1.0？

感谢您的帮助。我是 python 的新手，这是我的第一个问题，所以我希望这个提交是可以接受的。

Answer 1

您的问题，请尝试以下 fasta：

>test
AAAAAA

"AAAAAA".count("AA")

你得到：

应该是

原因

来自文档：计算 return 子字符串 sub 在字符串 s[start:end]

中出现的（非重叠）次数

解决方案 使用 Counter 和块函数

from Bio import SeqIO

def chunks(l, n):
  for i in xrange(0, len(l)-(n-1)):
    yield l[i:i+n]

from collections import Counter

frequency = []
input_file = "test.fasta"
for fasta in SeqIO.parse(open(input_file), "fasta"):
  dna = str(fasta.seq)
  freq = Counter(dna)   #get counter of single bases
  freq.update(Counter(chunks(dna,2))) #update with counter of dinucleotide
  frequency.append(freq)

对于 "AAAAAA" 你得到：

Counter({'A': 6, 'AA': 5})

Answer 2

str.count() 不计算它找到的重叠主题。

示例：

如果你的序列中有 'AAAA' 并且你寻找二核苷酸 'AA'，你期望比 'AAAA'.count('AA') return你3，但它会return 2。所以：

print float('AAAA'.count('AA')) / (len('AAAA') - 1)
0.666666

而不是 1

您可以通过以下方式更改计算频率的行：

twofreq = len([i for i in range(len(dna)-1) if dna[i:i+2] == dinucleotide]) / float((len(dna) - 1))

Answer 3

您扫描字符串的次数远远超出您的需要 - 实际上是 20 次。这对于小的测试序列可能无关紧要，但随着它们变大，它会很明显。我会推荐一种不同的方法，它解决了重叠作为副作用的问题：

nucleotides = [ 'A', 'T', 'G', 'C' ]
dinucleotides = [ x+y for x in nucleotides for y in nucleotides ]
counts = { x : 0 for x in nucleotides + dinucleotides }

# count the first nucleotide, which has no previous one
n_nucl = 1
prevn = dna[0]
counts[prevn] += 1

# count the rest, along with the pairs made with each previous one
for nucl in dna[1:]:
    counts[nucl] += 1
    counts[prevn + nucl] += 1
    n_nucl += 1
    prevn = nucl

total = 0.0
for nucl in nucleotides:
    pct = counts[nucl] / float(n_nucl)
    total += pct
    print "{} : {} {}%".format(nucl, counts[nucl], pct)
print "Total : {}%".format(total) 

total = 0.0
for dnucl in dinucleotides:
    pct = counts[dnucl] / float(n_nucl - 1)
    total += pct
    print "{} : {} {}%".format(dnucl, counts[dnucl], pct)
print "Total : {}%".format(total)

这种方法只扫描一次字符串，虽然它确实是更多的代码...

频率加起来不为一

Frequencies not adding up to one

python

biopython

python-2.7