如何对单个字典中的值进行成对比较?
How to do pairwise comparison of values in a single dictionary?
所以我有包含 DNA 序列的 fasta 文件,我想对每个 DNA 序列进行成对比较。
Fasta 文件包含以下形式的内容:
>dna1
TAGTACTGACCATGGCGTTTGTTG
>dna2
ACCTTGAGATACAAAACGATTGGACTG
>dna3
GCTTCACTGATGCAGTATTCAATTAACCAG
>dna4
CCACTGGAGCTTTCCAAAGGG
>dna5
TCTGTGGGTCCGGTTGTACAG
我的方法是首先从 DNA 序列的快速文件中创建一个字典,然后对字典中的值进行成对比较以找到每对序列之间的 %age identity!!
我在进行成对比较时遇到问题!
我的代码如下:
from collections import OrderedDict
from typing import Dict
# Convert the fasta file to dictionary
DnaName_SYMBOL = '>'
def parse_DNAsequences(filename: str,
ordered: bool=False) -> Dict[str, str]:
# filename: str is the DNA sequence name
# ordered: bool, Gives us an option to order the resulting dictionary
result = OrderedDict() if ordered else {}
last_name = None
with open(filename) as sequences:
for line in sequences:
if line.startswith(DnaName_SYMBOL):
last_name = line[1:-1]
result[last_name] = []
else:
result[last_name].append(line[:-1])
for name in result:
result[name] = ''.join(result[name])
return result
DNAdict = parse_DNAsequences('output.fas')
这部分是我遇到问题的地方,遍历字典值:
def PairwiseComparison():
match = sum(s1 == s2 for s1, s2 in zip(a,b))
if len(s1) > len(s2):
lengthchosen = len(s1)
percentidentity = 100*match/lengthchosen
print('{} vs {} {}%').format(percentidentity)
输出应该是这种形式:
dna1 vs dna2 90%
dna1 vs dna3 100%
dna2 vs dna3 90%
其他注意事项是如果我们比较 2 个 dna 序列并且其中一个的长度大于另一个,那么我们将使用该长度来计算 百分比两者之间的身份(其中百分比身份是matches/the总长度的#)
我想你已经成功地创建了 DNA 字典,我已经在我的示例中对其进行了硬编码。
繁重的工作由 itertools.combinations
完成
from itertools import combinations
dna_dict = {
'dna1': 'TAGTACTGACCATGGCGTTTGTTG',
'dna2': 'ACCTTGAGATACAAAACGATTGGACTG',
'dna3': 'GCTTCACTGATGCAGTATTCAATTAACCAG'
}
def PairwiseComparison(d1, d2):
match = sum(s1 == s2 for s1, s2 in zip(d1,d2))
if len(d1) > len(d2):
lengthchosen = len(d1)
else:
lengthchosen = len(d2)
percentidentity = 100*match/lengthchosen
return percentidentity
# Creates all the possible combinations of the dictionary keys
dna_combinations = combinations(dna_dict, 2)
for dna1, dna2 in dna_combinations:
percent_identity = PairwiseComparison(dna_dict[dna1], dna_dict[dna2])
print(f'{dna1} vs {dna2} {percent_identity}%') # The f-strings need python > 3.6, you can change the format if you have a lower version
如果您有任何需要clarifications/additions,请随时添加评论。
所以我有包含 DNA 序列的 fasta 文件,我想对每个 DNA 序列进行成对比较。
Fasta 文件包含以下形式的内容:
>dna1
TAGTACTGACCATGGCGTTTGTTG
>dna2
ACCTTGAGATACAAAACGATTGGACTG
>dna3
GCTTCACTGATGCAGTATTCAATTAACCAG
>dna4
CCACTGGAGCTTTCCAAAGGG
>dna5
TCTGTGGGTCCGGTTGTACAG
我的方法是首先从 DNA 序列的快速文件中创建一个字典,然后对字典中的值进行成对比较以找到每对序列之间的 %age identity!!
我在进行成对比较时遇到问题!
我的代码如下:
from collections import OrderedDict
from typing import Dict
# Convert the fasta file to dictionary
DnaName_SYMBOL = '>'
def parse_DNAsequences(filename: str,
ordered: bool=False) -> Dict[str, str]:
# filename: str is the DNA sequence name
# ordered: bool, Gives us an option to order the resulting dictionary
result = OrderedDict() if ordered else {}
last_name = None
with open(filename) as sequences:
for line in sequences:
if line.startswith(DnaName_SYMBOL):
last_name = line[1:-1]
result[last_name] = []
else:
result[last_name].append(line[:-1])
for name in result:
result[name] = ''.join(result[name])
return result
DNAdict = parse_DNAsequences('output.fas')
这部分是我遇到问题的地方,遍历字典值:
def PairwiseComparison():
match = sum(s1 == s2 for s1, s2 in zip(a,b))
if len(s1) > len(s2):
lengthchosen = len(s1)
percentidentity = 100*match/lengthchosen
print('{} vs {} {}%').format(percentidentity)
输出应该是这种形式:
dna1 vs dna2 90%
dna1 vs dna3 100%
dna2 vs dna3 90%
其他注意事项是如果我们比较 2 个 dna 序列并且其中一个的长度大于另一个,那么我们将使用该长度来计算 百分比两者之间的身份(其中百分比身份是matches/the总长度的#)
我想你已经成功地创建了 DNA 字典,我已经在我的示例中对其进行了硬编码。 繁重的工作由 itertools.combinations
完成from itertools import combinations
dna_dict = {
'dna1': 'TAGTACTGACCATGGCGTTTGTTG',
'dna2': 'ACCTTGAGATACAAAACGATTGGACTG',
'dna3': 'GCTTCACTGATGCAGTATTCAATTAACCAG'
}
def PairwiseComparison(d1, d2):
match = sum(s1 == s2 for s1, s2 in zip(d1,d2))
if len(d1) > len(d2):
lengthchosen = len(d1)
else:
lengthchosen = len(d2)
percentidentity = 100*match/lengthchosen
return percentidentity
# Creates all the possible combinations of the dictionary keys
dna_combinations = combinations(dna_dict, 2)
for dna1, dna2 in dna_combinations:
percent_identity = PairwiseComparison(dna_dict[dna1], dna_dict[dna2])
print(f'{dna1} vs {dna2} {percent_identity}%') # The f-strings need python > 3.6, you can change the format if you have a lower version
如果您有任何需要clarifications/additions,请随时添加评论。