如何为给定外群的一组物种生成所有可能的 Newick 树排列?
How do I generate all possible Newick Tree permutations for a set of species given an outgroup?
如何为给定外群的一组物种生成所有可能的 Newick 树排列?
对于那些不知道什么是 Newick 树格式的人,可以在以下网址找到很好的说明:
https://en.wikipedia.org/wiki/Newick_format
我想为给定外群的一组物种创建所有可能的 Newick 树排列。我期望处理的叶节点数最有可能是 4、5 或 6 个叶节点。
允许 "Soft" 和 "hard" 多项式。
https://en.wikipedia.org/wiki/Polytomy#Soft_polytomies_vs._hard_polytomies
https://biology.stackexchange.com/questions/23667/evidence-discussions-of-hard-polytomy
下图是理想输出,"E"设置为外群
理想输出:
((("A","B","C"),("D"),("E"));
((("A","B","D"),("C"),("E"));
((("A","C","D"),("B"),("E"));
((("B","C","D"),("A"),("E"));
((("A","B")("C","D"),("E"));
((("A","C")("B","D"),("E"));
((("B","C")("A","D"),("E"));
(("A","B","C","D"),("E"));
(((("A","B"),"C"),"D"),("E"));
但是,我使用 itertools 的任何可能的解决方案,特别是 itertools.permutations,都遇到了等效输出的问题。我想到的最后一个想法涉及如下所示的等效输出。
等效输出:
((("C","B","A"),("D"),("E"));
((("C","A","B"),("D"),("E"));
((("A","C","B"),("D"),("E"));
这是我的解决方案想法的开始。但是,除了 itertools 之外,我现在真的不确定这个问题该怎么办。
import itertools
def Newick_Permutation_Generator(list_of_species, name_of_outgroup)
permutations_list =(list(itertools.permutations(["A","B","C","D","E"])))
for given_permutation in permutations_list:
process(given_permutation)
Newick_Permutation_Generator(["A","B","C","D","E"], "E")
这是一道难题!这是我的旅程。
第一个观察结果是外群始终是附加到 newick 字符串末尾的单个节点。让我们将其余物种称为内群,并尝试生成这些物种的所有排列。然后简单地添加外群。
from itertools import permutations
def ingroup_generator(species, n):
for perm in permutations(species, n):
yield tuple([tuple(perm), tuple(s for s in species if s not in perm)])
def format_newick(s, outgroup=''):
return '(' + ', '.join('({})'.format(', '.join(p)) for p in s) + ',({}));'.format(outgroup)
species = ["A","B","C","D","E"]
outgroup = "E"
ingroup = [s for s in species if s != outgroup]
itertools_newicks= []
for n in range(1, len(ingroup)):
for p in ingroup_generator(ingroup, n):
itertools_newicks.append(format_newick(p, outgroup))
for newick in itertools_newicks:
print newick
这returns 40个newick字符串:
((A), (B, C, D),(E));
((B), (A, C, D),(E));
((C), (A, B, D),(E));
((D), (A, B, C),(E));
((A, B), (C, D),(E));
((A, C), (B, D),(E));
((A, D), (B, C),(E));
((B, A), (C, D),(E));
((B, C), (A, D),(E));
((B, D), (A, C),(E));
((C, A), (B, D),(E));
((C, B), (A, D),(E));
((C, D), (A, B),(E));
((D, A), (B, C),(E));
((D, B), (A, C),(E));
((D, C), (A, B),(E));
((A, B, C), (D),(E));
((A, B, D), (C),(E));
((A, C, B), (D),(E));
((A, C, D), (B),(E));
((A, D, B), (C),(E));
((A, D, C), (B),(E));
((B, A, C), (D),(E));
((B, A, D), (C),(E));
((B, C, A), (D),(E));
((B, C, D), (A),(E));
((B, D, A), (C),(E));
((B, D, C), (A),(E));
((C, A, B), (D),(E));
((C, A, D), (B),(E));
((C, B, A), (D),(E));
((C, B, D), (A),(E));
((C, D, A), (B),(E));
((C, D, B), (A),(E));
((D, A, B), (C),(E));
((D, A, C), (B),(E));
((D, B, A), (C),(E));
((D, B, C), (A),(E));
((D, C, A), (B),(E));
((D, C, B), (A),(E));
其中一些是重复的,但我们稍后会删除重复的。
与一样,(((("A","B"),"C"),"D"),("E"));
及其变体也应被视为有效解。
comments on BioStar pointed me in the right direction that this is the same as generating all the possible groupings of a binary tree. I found a nice Python implementation in this Whosebug answer by rici:
# A very simple representation for Nodes. Leaves are anything which is not a Node.
class Node(object):
def __init__(self, left, right):
self.left = left
self.right = right
def __repr__(self):
return '(%s, %s)' % (self.left, self.right)
# Given a tree and a label, yields every possible augmentation of the tree by
# adding a new node with the label as a child "above" some existing Node or Leaf.
def add_leaf(tree, label):
yield Node(label, tree)
if isinstance(tree, Node):
for left in add_leaf(tree.left, label):
yield Node(left, tree.right)
for right in add_leaf(tree.right, label):
yield Node(tree.left, right)
# Given a list of labels, yield each rooted, unordered full binary tree with
# the specified labels.
def enum_unordered(labels):
if len(labels) == 1:
yield labels[0]
else:
for tree in enum_unordered(labels[1:]):
for new_tree in add_leaf(tree, labels[0]):
yield new_tree
然后,
enum_newicks= []
for t in enum_unordered(ingroup):
enum_newicks.append('({},({}));'.format(t, outgroup))
for newick in enum_newicks:
print newick
产生以下 15 个 newicks:
((A, (B, (C, D))),(E));
(((A, B), (C, D)),(E));
((B, (A, (C, D))),(E));
((B, ((A, C), D)),(E));
((B, (C, (A, D))),(E));
((A, ((B, C), D)),(E));
(((A, (B, C)), D),(E));
((((A, B), C), D),(E));
(((B, (A, C)), D),(E));
(((B, C), (A, D)),(E));
((A, (C, (B, D))),(E));
(((A, C), (B, D)),(E));
((C, (A, (B, D))),(E));
((C, ((A, B), D)),(E));
((C, (B, (A, D))),(E));
所以现在我们已经有 40 + 15 = 55 个可能的 newick 字符串,我们必须删除重复项。
我尝试的第一个死胡同是创建每个 newick 字符串的规范表示,以便我可以将它们用作字典中的键。这个想法是递归地对所有节点中的字符串进行排序。但首先我必须捕获所有(嵌套的)节点。我不能使用正则表达式,因为 nested structures are by definition not regular。
所以我使用了 pyparsing
包并想出了这个:
from pyparsing import nestedExpr
def sort_newick(t):
if isinstance(t, str):
return sorted(t)
else:
if all(isinstance(c, str) for c in t):
return sorted(t)
if all(isinstance(l, list) for l in t):
return [sort_newick(l) for l in sorted(t, key=lambda k: sorted(k))]
else:
return [sort_newick(l) for l in t]
def canonical_newick(n):
n = n.replace(',', '')
p = nestedExpr().parseString(n).asList()
s = sort_newick(p)
return str(s)
这给了
from collections import defaultdict
all_newicks = itertools_newicks + enum_newicks
d = defaultdict(list)
for newick in all_newicks:
d[canonical_newick(newick)].append(newick)
for canonical, newicks in d.items():
print canonical
for newick in newicks:
print '\t', newick
print
有 22 个键的字典:
[[[['A'], [['C'], ['B', 'D']]], ['E']]]
((A, (C, (B, D))),(E));
[[[['B'], [['A'], ['C', 'D']]], ['E']]]
((B, (A, (C, D))),(E));
[[[['B'], [['A', 'C'], ['D']]], ['E']]]
((B, ((A, C), D)),(E));
[[['A', 'C', 'D'], ['B'], ['E']]]
((B), (A, C, D),(E));
((A, C, D), (B),(E));
((A, D, C), (B),(E));
((C, A, D), (B),(E));
((C, D, A), (B),(E));
((D, A, C), (B),(E));
((D, C, A), (B),(E));
[[['A', 'B'], ['C', 'D'], ['E']]]
((A, B), (C, D),(E));
((B, A), (C, D),(E));
((C, D), (A, B),(E));
((D, C), (A, B),(E));
[[[[['A'], ['B', 'C']], ['D']], ['E']]]
(((A, (B, C)), D),(E));
[[[['A', 'C'], ['B', 'D']], ['E']]]
(((A, C), (B, D)),(E));
[[['A'], ['B', 'C', 'D'], ['E']]]
((A), (B, C, D),(E));
((B, C, D), (A),(E));
((B, D, C), (A),(E));
((C, B, D), (A),(E));
((C, D, B), (A),(E));
((D, B, C), (A),(E));
((D, C, B), (A),(E));
[[[['A', 'D'], ['B', 'C']], ['E']]]
(((B, C), (A, D)),(E));
[[['A', 'B', 'C'], ['D'], ['E']]]
((D), (A, B, C),(E));
((A, B, C), (D),(E));
((A, C, B), (D),(E));
((B, A, C), (D),(E));
((B, C, A), (D),(E));
((C, A, B), (D),(E));
((C, B, A), (D),(E));
[[['A', 'C'], ['B', 'D'], ['E']]]
((A, C), (B, D),(E));
((B, D), (A, C),(E));
((C, A), (B, D),(E));
((D, B), (A, C),(E));
[[['A', 'B', 'D'], ['C'], ['E']]]
((C), (A, B, D),(E));
((A, B, D), (C),(E));
((A, D, B), (C),(E));
((B, A, D), (C),(E));
((B, D, A), (C),(E));
((D, A, B), (C),(E));
((D, B, A), (C),(E));
[[[['A'], [['B'], ['C', 'D']]], ['E']]]
((A, (B, (C, D))),(E));
[[[[['A', 'B'], ['C']], ['D']], ['E']]]
((((A, B), C), D),(E));
[[[[['B'], ['A', 'C']], ['D']], ['E']]]
(((B, (A, C)), D),(E));
[[[['C'], [['B'], ['A', 'D']]], ['E']]]
((C, (B, (A, D))),(E));
[[[['C'], [['A', 'B'], ['D']]], ['E']]]
((C, ((A, B), D)),(E));
[[[['A'], [['B', 'C'], ['D']]], ['E']]]
((A, ((B, C), D)),(E));
[[[['A', 'B'], ['C', 'D']], ['E']]]
(((A, B), (C, D)),(E));
[[[['B'], [['C'], ['A', 'D']]], ['E']]]
((B, (C, (A, D))),(E));
[[[['C'], [['A'], ['B', 'D']]], ['E']]]
((C, (A, (B, D))),(E));
[[['A', 'D'], ['B', 'C'], ['E']]]
((A, D), (B, C),(E));
((B, C), (A, D),(E));
((C, B), (A, D),(E));
((D, A), (B, C),(E));
但仔细检查发现了一些问题。让我们以 newicks '(((A, B), (C, D)),(E));
和 ((D, C), (A, B),(E));
为例。在我们的字典中 d
它们有不同的规范键,分别是 [[[['A', 'B'], ['C', 'D']], ['E']]]
和 [[['A', 'B'], ['C', 'D'], ['E']]]
。但实际上,这些是重复的树。我们可以通过查看两棵树之间的 Robinson-Foulds distance 来确认这一点。如果它为零,则树是相同的。
我们使用 ete3 toolkit package
中的 robinson_foulds
函数
from ete3 import Tree
tree1 = Tree('(((A, B), (C, D)),(E));')
tree2 = Tree('((D, C), (A, B),(E));')
rf, max_parts, common_attrs, edges1, edges2, discard_t1, discard_t2 = tree1.robinson_foulds(tree2, unrooted_trees=True)
print rf # returns 0
好的,Robinson-Foulds 是一种比我的规范树方法更好的检查 newick 树相等性的方法。让我们将所有 newick 字符串包装在一个自定义 MyTree
对象中,其中等式被定义为 Robinson-Foulds 距离为零:
class MyTree(Tree):
def __init__(self, *args, **kwargs):
super(MyTree, self).__init__(*args, **kwargs)
def __eq__(self, other):
rf = self.robinson_foulds(other, unrooted_trees=True)
return not bool(rf[0])
trees = [MyTree(newick) for newick in all_newicks]
如果我们还可以定义一个 __hash__()
函数,returns 重复树的相同值,那么 set(trees)
将自动删除所有重复项。
不幸的是,I haven't been able to find a good way to define __hash__()
, but with __eq__
in place, I could make use of index()
:
unique_trees = [trees[i] for i in range(len(trees)) if i == trees.index(trees[i])]
unique_newicks = [tree.write(format=9) for tree in unique_trees]
for unique_newick in unique_newicks:
print unique_newick
所以,我们的旅程到此结束。我无法完全证明这是正确的解决方案,但我非常有信心以下 19 个新人都是可能的不同排列:
((A),(B,C,D),(E));
((B),(A,C,D),(E));
((C),(A,B,D),(E));
((D),(A,B,C),(E));
((A,B),(C,D),(E));
((A,C),(B,D),(E));
((A,D),(B,C),(E));
((A,(B,(C,D))),(E));
((B,(A,(C,D))),(E));
((B,((A,C),D)),(E));
((B,(C,(A,D))),(E));
((A,((B,C),D)),(E));
(((A,(B,C)),D),(E));
((((A,B),C),D),(E));
(((B,(A,C)),D),(E));
((A,(C,(B,D))),(E));
((C,(A,(B,D))),(E));
((C,((A,B),D)),(E));
((C,(B,(A,D))),(E));
如果我们将每个 newick 与所有其他 newick 进行成对比较,我们将确认此列表中不再有重复项
from itertools import product
for n1, n2 in product(unique_newicks, repeat=2):
if n1 != n2:
mt1 = MyTree(n1)
mt2 = MyTree(n2)
assert mt1 != mt2
一棵树作为叶子集合的递归集
让我们暂时搁置 newick 表示,并考虑问题的可能 python 表示。
有根树可以被视为集合(集合(集合...))叶的递归层次结构。集合是无序的,这非常适合描述树中的进化枝:{{{"A", "B"}, {"C", "D"}}, "E"}
应该与 {{{"C", "D"}, {"B", "A"}}, "E"}
.
相同
如果我们考虑初始的叶子集合{"A", "B", "C", "D", "E"}
,以"E"为外群的树是{tree, "E"}
形式的集合集合,其中tree
s是取自可以从树叶集 {"A", "B", "C", "D"}
构建的树集。我们可以尝试写一个递归的trees
函数来生成这组树,我们的总树集可以表示为:
{{tree, "E"} for tree in trees({"A", "B", "C", "D"})}
(这里,我使用set comprehension表示法。)
实际上,python不允许集合的集合,因为集合的元素必须是"hashable"(也就是说,python必须能够计算一些"hash" 对象的值,以便能够检查它们是否属于该集合)。正好python集没有这个属性。幸运的是,我们可以使用一个名为 frozenset
的类似数据结构,它的行为很像一个集合,但不能被修改并且是 "hashable"。因此,我们的树集将是:
all_trees = frozenset(
{frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})
正在实施 trees
函数
现在让我们关注 trees
函数。
对于叶子集合的每个可能的partition(分解成一组不相交的子集,包括所有元素),我们需要找到所有可能的树(通过递归调用) 对于分区的每个部分。对于给定的分区,我们将为每个可能的子树组合创建一棵树。
例如,如果分区是 {"A", {"B", "C", "D"}}
,我们将考虑所有可能的树,这些树可以由部分 "A"
构成(实际上,只是叶子 "A"
本身),并且可以从 {"B", "C", "D"}
部分(即 trees({"B", "C", "D"})
)制作的所有可能的树。然后,将通过获取所有可能的对来获得此分区的可能树,其中一个元素仅来自 "A"
,另一个来自 trees({"B", "C", "D"})
.
这可以推广到包含两个以上部分的分区,itertools
中的 product
函数在这里似乎很有用。
因此,我们需要一种方法来生成一组叶子的可能分区。
正在生成集合的分区
这里我做了一个partitions_of_set
函数改编自this solution:
# According to
# The problem is solved recursively:
# If you already have a partition of n-1 elements, how do you use it to partition n elements?
# Either place the n'th element in one of the existing subsets, or add it as a new, singleton subset.
def partitions_of_set(s):
if len(s) == 1:
yield frozenset(s)
return
# Extract one element from the set
#
elem, *_ = s
rest = frozenset(s - {elem})
for partition in partitions_of_set(rest):
for subset in partition:
# Insert the element in the subset
try:
augmented_subset = frozenset(subset | frozenset({elem}))
except TypeError:
# subset is actually an atomic element
augmented_subset = frozenset({subset} | frozenset({elem}))
yield frozenset({augmented_subset}) | (partition - {subset})
# Case with the element in its own extra subset
yield frozenset({elem}) | partition
为了检查获得的分区,我们创建了一个函数来使它们更易于显示(这对于稍后制作树的 newick 表示也很有用):
def print_set(f):
if type(f) not in (set, frozenset):
return str(f)
return "(" + ",".join(sorted(map(print_set, f))) + ")"
我们测试分区是否有效:
for partition in partitions_of_set({"A", "B", "C", "D"}):
print(len(partition), print_set(partition))
输出:
1 ((A,B,C,D))
2 ((A,B,D),C)
2 ((A,C),(B,D))
2 ((B,C,D),A)
3 ((B,D),A,C)
2 ((A,B,C),D)
2 ((A,B),(C,D))
3 ((A,B),C,D)
2 ((A,D),(B,C))
2 ((A,C,D),B)
3 ((A,D),B,C)
3 ((A,C),B,D)
3 ((B,C),A,D)
3 ((C,D),A,B)
4 (A,B,C,D)
trees
函数的实际代码
现在我们可以编写tree
函数:
from itertools import product
def trees(leaves):
if type(leaves) not in (set, frozenset):
# It actually is a single leaf
yield leaves
# Don't try to yield any more trees
return
# Otherwise, we will have to consider all the possible
# partitions of the set of leaves, and for each partition,
# construct the possible trees for each part
for partition in partitions_of_set(leaves):
# We need to skip the case where the partition
# has only one subset (the initial set itself),
# otherwise we will try to build an infinite
# succession of nodes with just one subtree
if len(partition) == 1:
part, *_ = partition
# Just to be sure the assumption is correct
assert part == leaves
continue
# We recursively apply *tree* to each part
# and obtain the possible trees by making
# the product of the sets of possible subtrees.
for subtree in product(*map(trees, partition)):
# Using a frozenset guarantees
# that there will be no duplicates
yield frozenset(subtree)
正在测试:
all_trees = frozenset(
{frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})
for tree in all_trees:
print(print_set(tree) + ";")
输出:
(((B,C),A,D),E);
((((A,B),D),C),E);
((((B,D),A),C),E);
((((C,D),A),B),E);
(((A,D),B,C),E);
((A,B,C,D),E);
((((B,D),C),A),E);
(((A,B,C),D),E);
((((A,C),B),D),E);
((((C,D),B),A),E);
((((B,C),A),D),E);
(((A,B),C,D),E);
(((A,C),(B,D)),E);
(((B,D),A,C),E);
(((C,D),A,B),E);
((((A,B),C),D),E);
((((A,C),D),B),E);
(((A,C,D),B),E);
(((A,D),(B,C)),E);
((((A,D),C),B),E);
((((B,C),D),A),E);
(((A,B),(C,D)),E);
(((A,B,D),C),E);
((((A,D),B),C),E);
(((A,C),B,D),E);
(((B,C,D),A),E);
希望结果是正确的
这种方法要正确使用有点棘手。我花了一些时间才弄清楚如何避免无限递归(当分区为 {{"A", "B", "C", "D"}}
时会发生这种情况)。
如何为给定外群的一组物种生成所有可能的 Newick 树排列?
对于那些不知道什么是 Newick 树格式的人,可以在以下网址找到很好的说明: https://en.wikipedia.org/wiki/Newick_format
我想为给定外群的一组物种创建所有可能的 Newick 树排列。我期望处理的叶节点数最有可能是 4、5 或 6 个叶节点。
允许 "Soft" 和 "hard" 多项式。 https://en.wikipedia.org/wiki/Polytomy#Soft_polytomies_vs._hard_polytomies https://biology.stackexchange.com/questions/23667/evidence-discussions-of-hard-polytomy
下图是理想输出,"E"设置为外群
理想输出:
((("A","B","C"),("D"),("E"));
((("A","B","D"),("C"),("E"));
((("A","C","D"),("B"),("E"));
((("B","C","D"),("A"),("E"));
((("A","B")("C","D"),("E"));
((("A","C")("B","D"),("E"));
((("B","C")("A","D"),("E"));
(("A","B","C","D"),("E"));
(((("A","B"),"C"),"D"),("E"));
但是,我使用 itertools 的任何可能的解决方案,特别是 itertools.permutations,都遇到了等效输出的问题。我想到的最后一个想法涉及如下所示的等效输出。
等效输出:
((("C","B","A"),("D"),("E"));
((("C","A","B"),("D"),("E"));
((("A","C","B"),("D"),("E"));
这是我的解决方案想法的开始。但是,除了 itertools 之外,我现在真的不确定这个问题该怎么办。
import itertools
def Newick_Permutation_Generator(list_of_species, name_of_outgroup)
permutations_list =(list(itertools.permutations(["A","B","C","D","E"])))
for given_permutation in permutations_list:
process(given_permutation)
Newick_Permutation_Generator(["A","B","C","D","E"], "E")
这是一道难题!这是我的旅程。
第一个观察结果是外群始终是附加到 newick 字符串末尾的单个节点。让我们将其余物种称为内群,并尝试生成这些物种的所有排列。然后简单地添加外群。
from itertools import permutations
def ingroup_generator(species, n):
for perm in permutations(species, n):
yield tuple([tuple(perm), tuple(s for s in species if s not in perm)])
def format_newick(s, outgroup=''):
return '(' + ', '.join('({})'.format(', '.join(p)) for p in s) + ',({}));'.format(outgroup)
species = ["A","B","C","D","E"]
outgroup = "E"
ingroup = [s for s in species if s != outgroup]
itertools_newicks= []
for n in range(1, len(ingroup)):
for p in ingroup_generator(ingroup, n):
itertools_newicks.append(format_newick(p, outgroup))
for newick in itertools_newicks:
print newick
这returns 40个newick字符串:
((A), (B, C, D),(E));
((B), (A, C, D),(E));
((C), (A, B, D),(E));
((D), (A, B, C),(E));
((A, B), (C, D),(E));
((A, C), (B, D),(E));
((A, D), (B, C),(E));
((B, A), (C, D),(E));
((B, C), (A, D),(E));
((B, D), (A, C),(E));
((C, A), (B, D),(E));
((C, B), (A, D),(E));
((C, D), (A, B),(E));
((D, A), (B, C),(E));
((D, B), (A, C),(E));
((D, C), (A, B),(E));
((A, B, C), (D),(E));
((A, B, D), (C),(E));
((A, C, B), (D),(E));
((A, C, D), (B),(E));
((A, D, B), (C),(E));
((A, D, C), (B),(E));
((B, A, C), (D),(E));
((B, A, D), (C),(E));
((B, C, A), (D),(E));
((B, C, D), (A),(E));
((B, D, A), (C),(E));
((B, D, C), (A),(E));
((C, A, B), (D),(E));
((C, A, D), (B),(E));
((C, B, A), (D),(E));
((C, B, D), (A),(E));
((C, D, A), (B),(E));
((C, D, B), (A),(E));
((D, A, B), (C),(E));
((D, A, C), (B),(E));
((D, B, A), (C),(E));
((D, B, C), (A),(E));
((D, C, A), (B),(E));
((D, C, B), (A),(E));
其中一些是重复的,但我们稍后会删除重复的。
与(((("A","B"),"C"),"D"),("E"));
及其变体也应被视为有效解。
comments on BioStar pointed me in the right direction that this is the same as generating all the possible groupings of a binary tree. I found a nice Python implementation in this Whosebug answer by rici:
# A very simple representation for Nodes. Leaves are anything which is not a Node.
class Node(object):
def __init__(self, left, right):
self.left = left
self.right = right
def __repr__(self):
return '(%s, %s)' % (self.left, self.right)
# Given a tree and a label, yields every possible augmentation of the tree by
# adding a new node with the label as a child "above" some existing Node or Leaf.
def add_leaf(tree, label):
yield Node(label, tree)
if isinstance(tree, Node):
for left in add_leaf(tree.left, label):
yield Node(left, tree.right)
for right in add_leaf(tree.right, label):
yield Node(tree.left, right)
# Given a list of labels, yield each rooted, unordered full binary tree with
# the specified labels.
def enum_unordered(labels):
if len(labels) == 1:
yield labels[0]
else:
for tree in enum_unordered(labels[1:]):
for new_tree in add_leaf(tree, labels[0]):
yield new_tree
然后,
enum_newicks= []
for t in enum_unordered(ingroup):
enum_newicks.append('({},({}));'.format(t, outgroup))
for newick in enum_newicks:
print newick
产生以下 15 个 newicks:
((A, (B, (C, D))),(E));
(((A, B), (C, D)),(E));
((B, (A, (C, D))),(E));
((B, ((A, C), D)),(E));
((B, (C, (A, D))),(E));
((A, ((B, C), D)),(E));
(((A, (B, C)), D),(E));
((((A, B), C), D),(E));
(((B, (A, C)), D),(E));
(((B, C), (A, D)),(E));
((A, (C, (B, D))),(E));
(((A, C), (B, D)),(E));
((C, (A, (B, D))),(E));
((C, ((A, B), D)),(E));
((C, (B, (A, D))),(E));
所以现在我们已经有 40 + 15 = 55 个可能的 newick 字符串,我们必须删除重复项。
我尝试的第一个死胡同是创建每个 newick 字符串的规范表示,以便我可以将它们用作字典中的键。这个想法是递归地对所有节点中的字符串进行排序。但首先我必须捕获所有(嵌套的)节点。我不能使用正则表达式,因为 nested structures are by definition not regular。
所以我使用了 pyparsing
包并想出了这个:
from pyparsing import nestedExpr
def sort_newick(t):
if isinstance(t, str):
return sorted(t)
else:
if all(isinstance(c, str) for c in t):
return sorted(t)
if all(isinstance(l, list) for l in t):
return [sort_newick(l) for l in sorted(t, key=lambda k: sorted(k))]
else:
return [sort_newick(l) for l in t]
def canonical_newick(n):
n = n.replace(',', '')
p = nestedExpr().parseString(n).asList()
s = sort_newick(p)
return str(s)
这给了
from collections import defaultdict
all_newicks = itertools_newicks + enum_newicks
d = defaultdict(list)
for newick in all_newicks:
d[canonical_newick(newick)].append(newick)
for canonical, newicks in d.items():
print canonical
for newick in newicks:
print '\t', newick
print
有 22 个键的字典:
[[[['A'], [['C'], ['B', 'D']]], ['E']]]
((A, (C, (B, D))),(E));
[[[['B'], [['A'], ['C', 'D']]], ['E']]]
((B, (A, (C, D))),(E));
[[[['B'], [['A', 'C'], ['D']]], ['E']]]
((B, ((A, C), D)),(E));
[[['A', 'C', 'D'], ['B'], ['E']]]
((B), (A, C, D),(E));
((A, C, D), (B),(E));
((A, D, C), (B),(E));
((C, A, D), (B),(E));
((C, D, A), (B),(E));
((D, A, C), (B),(E));
((D, C, A), (B),(E));
[[['A', 'B'], ['C', 'D'], ['E']]]
((A, B), (C, D),(E));
((B, A), (C, D),(E));
((C, D), (A, B),(E));
((D, C), (A, B),(E));
[[[[['A'], ['B', 'C']], ['D']], ['E']]]
(((A, (B, C)), D),(E));
[[[['A', 'C'], ['B', 'D']], ['E']]]
(((A, C), (B, D)),(E));
[[['A'], ['B', 'C', 'D'], ['E']]]
((A), (B, C, D),(E));
((B, C, D), (A),(E));
((B, D, C), (A),(E));
((C, B, D), (A),(E));
((C, D, B), (A),(E));
((D, B, C), (A),(E));
((D, C, B), (A),(E));
[[[['A', 'D'], ['B', 'C']], ['E']]]
(((B, C), (A, D)),(E));
[[['A', 'B', 'C'], ['D'], ['E']]]
((D), (A, B, C),(E));
((A, B, C), (D),(E));
((A, C, B), (D),(E));
((B, A, C), (D),(E));
((B, C, A), (D),(E));
((C, A, B), (D),(E));
((C, B, A), (D),(E));
[[['A', 'C'], ['B', 'D'], ['E']]]
((A, C), (B, D),(E));
((B, D), (A, C),(E));
((C, A), (B, D),(E));
((D, B), (A, C),(E));
[[['A', 'B', 'D'], ['C'], ['E']]]
((C), (A, B, D),(E));
((A, B, D), (C),(E));
((A, D, B), (C),(E));
((B, A, D), (C),(E));
((B, D, A), (C),(E));
((D, A, B), (C),(E));
((D, B, A), (C),(E));
[[[['A'], [['B'], ['C', 'D']]], ['E']]]
((A, (B, (C, D))),(E));
[[[[['A', 'B'], ['C']], ['D']], ['E']]]
((((A, B), C), D),(E));
[[[[['B'], ['A', 'C']], ['D']], ['E']]]
(((B, (A, C)), D),(E));
[[[['C'], [['B'], ['A', 'D']]], ['E']]]
((C, (B, (A, D))),(E));
[[[['C'], [['A', 'B'], ['D']]], ['E']]]
((C, ((A, B), D)),(E));
[[[['A'], [['B', 'C'], ['D']]], ['E']]]
((A, ((B, C), D)),(E));
[[[['A', 'B'], ['C', 'D']], ['E']]]
(((A, B), (C, D)),(E));
[[[['B'], [['C'], ['A', 'D']]], ['E']]]
((B, (C, (A, D))),(E));
[[[['C'], [['A'], ['B', 'D']]], ['E']]]
((C, (A, (B, D))),(E));
[[['A', 'D'], ['B', 'C'], ['E']]]
((A, D), (B, C),(E));
((B, C), (A, D),(E));
((C, B), (A, D),(E));
((D, A), (B, C),(E));
但仔细检查发现了一些问题。让我们以 newicks '(((A, B), (C, D)),(E));
和 ((D, C), (A, B),(E));
为例。在我们的字典中 d
它们有不同的规范键,分别是 [[[['A', 'B'], ['C', 'D']], ['E']]]
和 [[['A', 'B'], ['C', 'D'], ['E']]]
。但实际上,这些是重复的树。我们可以通过查看两棵树之间的 Robinson-Foulds distance 来确认这一点。如果它为零,则树是相同的。
我们使用 ete3 toolkit package
中的robinson_foulds
函数
from ete3 import Tree
tree1 = Tree('(((A, B), (C, D)),(E));')
tree2 = Tree('((D, C), (A, B),(E));')
rf, max_parts, common_attrs, edges1, edges2, discard_t1, discard_t2 = tree1.robinson_foulds(tree2, unrooted_trees=True)
print rf # returns 0
好的,Robinson-Foulds 是一种比我的规范树方法更好的检查 newick 树相等性的方法。让我们将所有 newick 字符串包装在一个自定义 MyTree
对象中,其中等式被定义为 Robinson-Foulds 距离为零:
class MyTree(Tree):
def __init__(self, *args, **kwargs):
super(MyTree, self).__init__(*args, **kwargs)
def __eq__(self, other):
rf = self.robinson_foulds(other, unrooted_trees=True)
return not bool(rf[0])
trees = [MyTree(newick) for newick in all_newicks]
如果我们还可以定义一个 __hash__()
函数,returns 重复树的相同值,那么 set(trees)
将自动删除所有重复项。
不幸的是,I haven't been able to find a good way to define __hash__()
, but with __eq__
in place, I could make use of index()
:
unique_trees = [trees[i] for i in range(len(trees)) if i == trees.index(trees[i])]
unique_newicks = [tree.write(format=9) for tree in unique_trees]
for unique_newick in unique_newicks:
print unique_newick
所以,我们的旅程到此结束。我无法完全证明这是正确的解决方案,但我非常有信心以下 19 个新人都是可能的不同排列:
((A),(B,C,D),(E));
((B),(A,C,D),(E));
((C),(A,B,D),(E));
((D),(A,B,C),(E));
((A,B),(C,D),(E));
((A,C),(B,D),(E));
((A,D),(B,C),(E));
((A,(B,(C,D))),(E));
((B,(A,(C,D))),(E));
((B,((A,C),D)),(E));
((B,(C,(A,D))),(E));
((A,((B,C),D)),(E));
(((A,(B,C)),D),(E));
((((A,B),C),D),(E));
(((B,(A,C)),D),(E));
((A,(C,(B,D))),(E));
((C,(A,(B,D))),(E));
((C,((A,B),D)),(E));
((C,(B,(A,D))),(E));
如果我们将每个 newick 与所有其他 newick 进行成对比较,我们将确认此列表中不再有重复项
from itertools import product
for n1, n2 in product(unique_newicks, repeat=2):
if n1 != n2:
mt1 = MyTree(n1)
mt2 = MyTree(n2)
assert mt1 != mt2
一棵树作为叶子集合的递归集
让我们暂时搁置 newick 表示,并考虑问题的可能 python 表示。
有根树可以被视为集合(集合(集合...))叶的递归层次结构。集合是无序的,这非常适合描述树中的进化枝:{{{"A", "B"}, {"C", "D"}}, "E"}
应该与 {{{"C", "D"}, {"B", "A"}}, "E"}
.
如果我们考虑初始的叶子集合{"A", "B", "C", "D", "E"}
,以"E"为外群的树是{tree, "E"}
形式的集合集合,其中tree
s是取自可以从树叶集 {"A", "B", "C", "D"}
构建的树集。我们可以尝试写一个递归的trees
函数来生成这组树,我们的总树集可以表示为:
{{tree, "E"} for tree in trees({"A", "B", "C", "D"})}
(这里,我使用set comprehension表示法。)
实际上,python不允许集合的集合,因为集合的元素必须是"hashable"(也就是说,python必须能够计算一些"hash" 对象的值,以便能够检查它们是否属于该集合)。正好python集没有这个属性。幸运的是,我们可以使用一个名为 frozenset
的类似数据结构,它的行为很像一个集合,但不能被修改并且是 "hashable"。因此,我们的树集将是:
all_trees = frozenset(
{frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})
正在实施 trees
函数
现在让我们关注 trees
函数。
对于叶子集合的每个可能的partition(分解成一组不相交的子集,包括所有元素),我们需要找到所有可能的树(通过递归调用) 对于分区的每个部分。对于给定的分区,我们将为每个可能的子树组合创建一棵树。
例如,如果分区是 {"A", {"B", "C", "D"}}
,我们将考虑所有可能的树,这些树可以由部分 "A"
构成(实际上,只是叶子 "A"
本身),并且可以从 {"B", "C", "D"}
部分(即 trees({"B", "C", "D"})
)制作的所有可能的树。然后,将通过获取所有可能的对来获得此分区的可能树,其中一个元素仅来自 "A"
,另一个来自 trees({"B", "C", "D"})
.
这可以推广到包含两个以上部分的分区,itertools
中的 product
函数在这里似乎很有用。
因此,我们需要一种方法来生成一组叶子的可能分区。
正在生成集合的分区
这里我做了一个partitions_of_set
函数改编自this solution:
# According to
# The problem is solved recursively:
# If you already have a partition of n-1 elements, how do you use it to partition n elements?
# Either place the n'th element in one of the existing subsets, or add it as a new, singleton subset.
def partitions_of_set(s):
if len(s) == 1:
yield frozenset(s)
return
# Extract one element from the set
#
elem, *_ = s
rest = frozenset(s - {elem})
for partition in partitions_of_set(rest):
for subset in partition:
# Insert the element in the subset
try:
augmented_subset = frozenset(subset | frozenset({elem}))
except TypeError:
# subset is actually an atomic element
augmented_subset = frozenset({subset} | frozenset({elem}))
yield frozenset({augmented_subset}) | (partition - {subset})
# Case with the element in its own extra subset
yield frozenset({elem}) | partition
为了检查获得的分区,我们创建了一个函数来使它们更易于显示(这对于稍后制作树的 newick 表示也很有用):
def print_set(f):
if type(f) not in (set, frozenset):
return str(f)
return "(" + ",".join(sorted(map(print_set, f))) + ")"
我们测试分区是否有效:
for partition in partitions_of_set({"A", "B", "C", "D"}):
print(len(partition), print_set(partition))
输出:
1 ((A,B,C,D))
2 ((A,B,D),C)
2 ((A,C),(B,D))
2 ((B,C,D),A)
3 ((B,D),A,C)
2 ((A,B,C),D)
2 ((A,B),(C,D))
3 ((A,B),C,D)
2 ((A,D),(B,C))
2 ((A,C,D),B)
3 ((A,D),B,C)
3 ((A,C),B,D)
3 ((B,C),A,D)
3 ((C,D),A,B)
4 (A,B,C,D)
trees
函数的实际代码
现在我们可以编写tree
函数:
from itertools import product
def trees(leaves):
if type(leaves) not in (set, frozenset):
# It actually is a single leaf
yield leaves
# Don't try to yield any more trees
return
# Otherwise, we will have to consider all the possible
# partitions of the set of leaves, and for each partition,
# construct the possible trees for each part
for partition in partitions_of_set(leaves):
# We need to skip the case where the partition
# has only one subset (the initial set itself),
# otherwise we will try to build an infinite
# succession of nodes with just one subtree
if len(partition) == 1:
part, *_ = partition
# Just to be sure the assumption is correct
assert part == leaves
continue
# We recursively apply *tree* to each part
# and obtain the possible trees by making
# the product of the sets of possible subtrees.
for subtree in product(*map(trees, partition)):
# Using a frozenset guarantees
# that there will be no duplicates
yield frozenset(subtree)
正在测试:
all_trees = frozenset(
{frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})
for tree in all_trees:
print(print_set(tree) + ";")
输出:
(((B,C),A,D),E);
((((A,B),D),C),E);
((((B,D),A),C),E);
((((C,D),A),B),E);
(((A,D),B,C),E);
((A,B,C,D),E);
((((B,D),C),A),E);
(((A,B,C),D),E);
((((A,C),B),D),E);
((((C,D),B),A),E);
((((B,C),A),D),E);
(((A,B),C,D),E);
(((A,C),(B,D)),E);
(((B,D),A,C),E);
(((C,D),A,B),E);
((((A,B),C),D),E);
((((A,C),D),B),E);
(((A,C,D),B),E);
(((A,D),(B,C)),E);
((((A,D),C),B),E);
((((B,C),D),A),E);
(((A,B),(C,D)),E);
(((A,B,D),C),E);
((((A,D),B),C),E);
(((A,C),B,D),E);
(((B,C,D),A),E);
希望结果是正确的
这种方法要正确使用有点棘手。我花了一些时间才弄清楚如何避免无限递归(当分区为 {{"A", "B", "C", "D"}}
时会发生这种情况)。