使用 biopython 计算对齐中相同站点百分比的更快方法
Faster way of calculating the percentage of identical sites in alignment using biopython
我开发了以下代码来计算比对中相同位点的数量。不幸的是,代码很慢,我必须对数百个文件进行迭代,处理超过 1000 个比对需要将近 12 个小时,这意味着速度快十倍是合适的。任何帮助将不胜感激:
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import AlignIO
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dna
from Bio.Align import MultipleSeqAlignment
import time
a = SeqRecord(Seq("CCAAGCTGAATCAGCTGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTAAGAGTTCCTTTCGAGCACTACAAGAAGACGATTCGCGCGAACCACCGCAT", generic_dna), id="Alpha")
b = SeqRecord(Seq("CGAAGCTGACTCAGTGGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACAATTCGTGCGAACCACCGCAT", generic_dna), id="Beta")
c = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCAGAATCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Gamma")
d = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCAGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Delta")
e = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Epsilon")
align = MultipleSeqAlignment([a, b, c], annotations={"tool": "demo"})
start_time = time.time()
if len(align) != 1:
for n in range(0,len(align[0])):
n=0
i=0
while n<len(align[0]): #part that needs to be faster
column = align[:,n]
if (column == len(column) * column[0]) == True:
i=i+1
n=n+1
match = float(i)
length = float(n)
global_identity = 100*(float(match/length))
print(global_identity)
print("--- %s seconds ---" % (time.time() - start_time))
So, you're trying to check that each of the 5 strings have the same characters in the column? If the characters in the column all match, you increment i
, else you increment n
.
您对代码的解释是正确的。
基于以上所述,我建议使用以下代码作为更快的替代方法。
我想 align
是这样的结构:
align = [
'AGCTCGCGGAGGCGCTGCT....',
'ACCTCGGAGGGCTGCTGTAC...',
'AGCTCGGAGGGCTGCTGTAC...',
# possibly more ...
]
我们尝试检测其中相同字符的列。在上方,第一列是 AAA
(匹配),下一列是 GCG
(不匹配)。
def all_equal(items):
"""Returns True iff all items are equal."""
first = items[0]
return all(x == first for x in items)
def compute_match(aligned_sequences):
"""Returns the ratio of same-character columns in ``aligned_sequences``.
:param aligned_sequences: a list of strings or equal length.
"""
match_count = 0
mismatch_count = 0
for chars in zip(*aligned_sequences):
# Here chars is a column of chars,
# one taken from each element of aligned_sequences.
if all_equal(chars):
match_count += 1
else:
mismatch_count += 1
return float(match_count) / float(mismatch_count)
# What would make more sense:
# return float(matches) / len(aligned_sequences[0])
更短的版本:
def compute_match(aligned_sequences):
match_count = sum(1 for chars in zip(*aligned_sequences) if all_equal(chars))
total = len(aligned_sequences[0])
mismatch_count = total - match_count # Obviously.
return ...
我开发了以下代码来计算比对中相同位点的数量。不幸的是,代码很慢,我必须对数百个文件进行迭代,处理超过 1000 个比对需要将近 12 个小时,这意味着速度快十倍是合适的。任何帮助将不胜感激:
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import AlignIO
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dna
from Bio.Align import MultipleSeqAlignment
import time
a = SeqRecord(Seq("CCAAGCTGAATCAGCTGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTAAGAGTTCCTTTCGAGCACTACAAGAAGACGATTCGCGCGAACCACCGCAT", generic_dna), id="Alpha")
b = SeqRecord(Seq("CGAAGCTGACTCAGTGGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACAATTCGTGCGAACCACCGCAT", generic_dna), id="Beta")
c = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCAGAATCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Gamma")
d = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCAGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Delta")
e = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Epsilon")
align = MultipleSeqAlignment([a, b, c], annotations={"tool": "demo"})
start_time = time.time()
if len(align) != 1:
for n in range(0,len(align[0])):
n=0
i=0
while n<len(align[0]): #part that needs to be faster
column = align[:,n]
if (column == len(column) * column[0]) == True:
i=i+1
n=n+1
match = float(i)
length = float(n)
global_identity = 100*(float(match/length))
print(global_identity)
print("--- %s seconds ---" % (time.time() - start_time))
So, you're trying to check that each of the 5 strings have the same characters in the column? If the characters in the column all match, you increment
i
, else you incrementn
.您对代码的解释是正确的。
基于以上所述,我建议使用以下代码作为更快的替代方法。
我想 align
是这样的结构:
align = [
'AGCTCGCGGAGGCGCTGCT....',
'ACCTCGGAGGGCTGCTGTAC...',
'AGCTCGGAGGGCTGCTGTAC...',
# possibly more ...
]
我们尝试检测其中相同字符的列。在上方,第一列是 AAA
(匹配),下一列是 GCG
(不匹配)。
def all_equal(items):
"""Returns True iff all items are equal."""
first = items[0]
return all(x == first for x in items)
def compute_match(aligned_sequences):
"""Returns the ratio of same-character columns in ``aligned_sequences``.
:param aligned_sequences: a list of strings or equal length.
"""
match_count = 0
mismatch_count = 0
for chars in zip(*aligned_sequences):
# Here chars is a column of chars,
# one taken from each element of aligned_sequences.
if all_equal(chars):
match_count += 1
else:
mismatch_count += 1
return float(match_count) / float(mismatch_count)
# What would make more sense:
# return float(matches) / len(aligned_sequences[0])
更短的版本:
def compute_match(aligned_sequences):
match_count = sum(1 for chars in zip(*aligned_sequences) if all_equal(chars))
total = len(aligned_sequences[0])
mismatch_count = total - match_count # Obviously.
return ...