Python / ete3:将最密切相关的叶子定位到系统发育树中的特定物种
Python / ete3: Target most closely related leaf to a specific species in a phylogenetic tree
我正在使用 Python 包 ete3
。我有树,例如:
((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);
我希望看到与树中特定节点最相关的叶子(这里的树是 Species1_order1
)。
在示例中,最密切相关的叶子是 Species2_order2
/ Species3_order2
和 Species4_order3
/Species5_order5
.
代码:
tree = ete3.Tree('((Species1_order1, \
(Species2_order2, Species3_order2)), \
Species4_order3, Species5_order5);')
新例子:
tree=ete3.Tree('((((((A,B),C),D),(E,F)),G),(H,I));')
我得到的结果是:
A B C D E F G H I
A 0.0 2.0 3.0 4.0 6.0 6.0 6.0 8.0 8.0
B 2.0 0.0 3.0 4.0 6.0 6.0 6.0 8.0 8.0
C 3.0 3.0 0.0 3.0 5.0 5.0 5.0 7.0 7.0
D 4.0 4.0 3.0 0.0 4.0 4.0 4.0 6.0 6.0
E 6.0 6.0 5.0 4.0 0.0 2.0 4.0 6.0 6.0
F 6.0 6.0 5.0 4.0 2.0 0.0 4.0 6.0 6.0
G 6.0 6.0 5.0 4.0 4.0 4.0 0.0 4.0 4.0
H 8.0 8.0 7.0 6.0 6.0 6.0 4.0 0.0 2.0
I 8.0 8.0 7.0 6.0 6.0 6.0 4.0 2.0 0.0
但是例如,E 和 F 与树中的 A、B、C 和 D 的距离相等,结果它们看起来比 D 更远。
一个好的矩阵结果应该是:
A B C D E F G H I
A 0 1 2 3 4 4 5 6 6
B 1 0 2 3 4 4 5 6 6
C 2 2 0 3 4 4 5 6 6
D 3 3 3 0 4 4 5 6 6
E 4 4 4 4 0 1 5 6 6
F 4 4 4 4 1 0 5 6 6
G 5 5 5 5 5 5 0 6 6
H 6 6 6 6 6 6 6 0 1
I 6 6 6 6 6 6 6 1 0
是吧?
正如评论中所讨论的,ete3
给了我们一个名为 Tree.get_closest_leaf
的函数,但它的输出不是预期的(我不确定这个值在这里代表什么):
>>> t=ete3.Tree('((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);')
>>> t.get_closest_leaf('Species2_order2')
(Tree node 'Species4_order3' (0x115b2f29), 0.0)
相反,您可以像这样获得节点距离:
import ete3
import pandas as pd
def make_matrix(tree):
def get_root_path(node):
root_path = [node]
if node.up:
root_path.extend(get_root_path(node.up))
return root_path
leaves = tree.get_leaves()
leaf_ct = len(leaves)
paths = {node.name: set(get_root_path(node)) for node in leaves}
col_lbls = [leaf.name for leaf in leaves]
dist_matrix = pd.np.array([pd.np.zeros(leaf_ct)] * leaf_ct)
df = pd.DataFrame(dist_matrix, index=col_lbls, columns=col_lbls)
for node1_name, col in df.iteritems():
for node2_name in col.keys():
path = paths[node2_name].symmetric_difference(paths[node1_name])
dist = sum(node.dist for node in path)
df.at[node1_name, node2_name] = dist
df.at[node2_name, node1_name] = dist
return df
注意:出于多种原因,这是一个次优的解决方案,但这个问题并不是要求最有效的解决方案。有关系统发育距离矩阵方法的更多信息,请参阅 this link。
此解决方案还使用了 pandas
,这有点矫枉过正,因为它实际上只是为了方便 row/column 标签。删除 pandas
依赖项并改为使用本机列表并不难。
这是输出:
>>> tree=ete3.Tree('((Species1_order1, (Species2_order2, Species3_order2)), Species4_order3, Species5_order5);')
>>> make_matrix(tree)
Species1_order1 Species2_order2 Species3_order2 Species4_order3 Species5_order5
Species1_order1 0.0 3.0 3.0 3.0 3.0
Species2_order2 3.0 0.0 2.0 4.0 4.0
Species3_order2 3.0 2.0 0.0 4.0 4.0
Species4_order3 3.0 4.0 4.0 0.0 2.0
Species5_order5 3.0 4.0 4.0 2.0 0.0
对于发布的更新,我没有发现任何错误。它似乎给出了正确的结果。这是 ete3 渲染的树(我突出显示了从 Interest_sequence
到 Rhopalosiphum_maidis_Hemiptera
的距离内计算的 4 跳):
这里是 Interest_sequence
对应的矩阵列:
>>> m['Interest_sequence']
Rhopalosiphum_maidis__Hemiptera 4.0
Drosophila_novamexicana__Hemiptera 5.0
Drosophila_arizonae__Hemiptera 6.0
Drosophila_navojoa__Hemiptera 6.0
Interest_sequence 0.0
Heliothis_virescens_droso_3a__nan 5.0
Mythimna_separata_droso__nan 6.0
Heliothis_virescens_droso_3i__nan 6.0
Scaptodrosophila_lebanonensis__Diptera 5.0
Mythimna_unipuncta_droso_A__nan 6.0
Xestia_c-nigrum_droso__nan 8.0
Helicoverpa_armigera_droso__nan 8.0
Mocis_latipes_droso__nan 7.0
Drosophila_busckii__Diptera 4.0
Drosophila_bipectinata__Diptera 5.0
Drosophila_mojavensis__Diptera 7.0
Drosophila_yakuba__Diptera 7.0
Drosophila_hydei__Diptera 7.0
Drosophila_serrata__Diptera 8.0
Drosophila_takahashii__Diptera 9.0
Drosophila_eugracilis__Diptera 11.0
Drosophila_ficusphila__Diptera 11.0
Drosophila_erecta__Diptera 12.0
Drosophila_melanogaster__Diptera 13.0
Sequence_A_nan__nan 14.0
Drosophila_sechellia__Diptera 15.0
Drosophila_simulans__Diptera 15.0
Drosophila_suzukii__Diptera 12.0
Drosophila_biarmipes__Diptera 12.0
Name: Interest_sequence, dtype: float64
ete3 函数按预期工作,您只需要按如下方式遍历树的节点:
from ete3 import Tree t = Tree() # Creates an empty tree
A = t.add_child(name="A") # Adds a new child to the current tree root and returns it
B = t.add_child(name="B") # Adds a second child to the current tree root and returns it
C = A.add_child(name="C") # Adds a new child to one of the branches D =
C.add_sister(name="D") # Adds a second child to same branch as before, but using a sister as the starting point
R = A.add_child(name="R") # Adds a third child to the branch. Multifurcations are supported
# Next, I add 6 random leaves to the R branch names_library is an optional argument. If no names are provided, they will be generated randomly.
R.populate(6, names_library=["r1","r2","r3","r4","r5","r6"])
for node in t.traverse():
print(node.get_closest_leaf())
注意: 如果内部节点是终端节点,它将有 2 个最近的叶子,用 node.get_children()
获取它们
我正在使用 Python 包 ete3
。我有树,例如:
((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);
我希望看到与树中特定节点最相关的叶子(这里的树是 Species1_order1
)。
在示例中,最密切相关的叶子是 Species2_order2
/ Species3_order2
和 Species4_order3
/Species5_order5
.
代码:
tree = ete3.Tree('((Species1_order1, \
(Species2_order2, Species3_order2)), \
Species4_order3, Species5_order5);')
新例子:
tree=ete3.Tree('((((((A,B),C),D),(E,F)),G),(H,I));')
我得到的结果是:
A B C D E F G H I
A 0.0 2.0 3.0 4.0 6.0 6.0 6.0 8.0 8.0
B 2.0 0.0 3.0 4.0 6.0 6.0 6.0 8.0 8.0
C 3.0 3.0 0.0 3.0 5.0 5.0 5.0 7.0 7.0
D 4.0 4.0 3.0 0.0 4.0 4.0 4.0 6.0 6.0
E 6.0 6.0 5.0 4.0 0.0 2.0 4.0 6.0 6.0
F 6.0 6.0 5.0 4.0 2.0 0.0 4.0 6.0 6.0
G 6.0 6.0 5.0 4.0 4.0 4.0 0.0 4.0 4.0
H 8.0 8.0 7.0 6.0 6.0 6.0 4.0 0.0 2.0
I 8.0 8.0 7.0 6.0 6.0 6.0 4.0 2.0 0.0
但是例如,E 和 F 与树中的 A、B、C 和 D 的距离相等,结果它们看起来比 D 更远。
一个好的矩阵结果应该是:
A B C D E F G H I
A 0 1 2 3 4 4 5 6 6
B 1 0 2 3 4 4 5 6 6
C 2 2 0 3 4 4 5 6 6
D 3 3 3 0 4 4 5 6 6
E 4 4 4 4 0 1 5 6 6
F 4 4 4 4 1 0 5 6 6
G 5 5 5 5 5 5 0 6 6
H 6 6 6 6 6 6 6 0 1
I 6 6 6 6 6 6 6 1 0
是吧?
正如评论中所讨论的,ete3
给了我们一个名为 Tree.get_closest_leaf
的函数,但它的输出不是预期的(我不确定这个值在这里代表什么):
>>> t=ete3.Tree('((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);')
>>> t.get_closest_leaf('Species2_order2')
(Tree node 'Species4_order3' (0x115b2f29), 0.0)
相反,您可以像这样获得节点距离:
import ete3
import pandas as pd
def make_matrix(tree):
def get_root_path(node):
root_path = [node]
if node.up:
root_path.extend(get_root_path(node.up))
return root_path
leaves = tree.get_leaves()
leaf_ct = len(leaves)
paths = {node.name: set(get_root_path(node)) for node in leaves}
col_lbls = [leaf.name for leaf in leaves]
dist_matrix = pd.np.array([pd.np.zeros(leaf_ct)] * leaf_ct)
df = pd.DataFrame(dist_matrix, index=col_lbls, columns=col_lbls)
for node1_name, col in df.iteritems():
for node2_name in col.keys():
path = paths[node2_name].symmetric_difference(paths[node1_name])
dist = sum(node.dist for node in path)
df.at[node1_name, node2_name] = dist
df.at[node2_name, node1_name] = dist
return df
注意:出于多种原因,这是一个次优的解决方案,但这个问题并不是要求最有效的解决方案。有关系统发育距离矩阵方法的更多信息,请参阅 this link。
此解决方案还使用了 pandas
,这有点矫枉过正,因为它实际上只是为了方便 row/column 标签。删除 pandas
依赖项并改为使用本机列表并不难。
这是输出:
>>> tree=ete3.Tree('((Species1_order1, (Species2_order2, Species3_order2)), Species4_order3, Species5_order5);')
>>> make_matrix(tree)
Species1_order1 Species2_order2 Species3_order2 Species4_order3 Species5_order5
Species1_order1 0.0 3.0 3.0 3.0 3.0
Species2_order2 3.0 0.0 2.0 4.0 4.0
Species3_order2 3.0 2.0 0.0 4.0 4.0
Species4_order3 3.0 4.0 4.0 0.0 2.0
Species5_order5 3.0 4.0 4.0 2.0 0.0
对于发布的更新,我没有发现任何错误。它似乎给出了正确的结果。这是 ete3 渲染的树(我突出显示了从 Interest_sequence
到 Rhopalosiphum_maidis_Hemiptera
的距离内计算的 4 跳):
这里是 Interest_sequence
对应的矩阵列:
>>> m['Interest_sequence']
Rhopalosiphum_maidis__Hemiptera 4.0
Drosophila_novamexicana__Hemiptera 5.0
Drosophila_arizonae__Hemiptera 6.0
Drosophila_navojoa__Hemiptera 6.0
Interest_sequence 0.0
Heliothis_virescens_droso_3a__nan 5.0
Mythimna_separata_droso__nan 6.0
Heliothis_virescens_droso_3i__nan 6.0
Scaptodrosophila_lebanonensis__Diptera 5.0
Mythimna_unipuncta_droso_A__nan 6.0
Xestia_c-nigrum_droso__nan 8.0
Helicoverpa_armigera_droso__nan 8.0
Mocis_latipes_droso__nan 7.0
Drosophila_busckii__Diptera 4.0
Drosophila_bipectinata__Diptera 5.0
Drosophila_mojavensis__Diptera 7.0
Drosophila_yakuba__Diptera 7.0
Drosophila_hydei__Diptera 7.0
Drosophila_serrata__Diptera 8.0
Drosophila_takahashii__Diptera 9.0
Drosophila_eugracilis__Diptera 11.0
Drosophila_ficusphila__Diptera 11.0
Drosophila_erecta__Diptera 12.0
Drosophila_melanogaster__Diptera 13.0
Sequence_A_nan__nan 14.0
Drosophila_sechellia__Diptera 15.0
Drosophila_simulans__Diptera 15.0
Drosophila_suzukii__Diptera 12.0
Drosophila_biarmipes__Diptera 12.0
Name: Interest_sequence, dtype: float64
ete3 函数按预期工作,您只需要按如下方式遍历树的节点:
from ete3 import Tree t = Tree() # Creates an empty tree
A = t.add_child(name="A") # Adds a new child to the current tree root and returns it
B = t.add_child(name="B") # Adds a second child to the current tree root and returns it
C = A.add_child(name="C") # Adds a new child to one of the branches D =
C.add_sister(name="D") # Adds a second child to same branch as before, but using a sister as the starting point
R = A.add_child(name="R") # Adds a third child to the branch. Multifurcations are supported
# Next, I add 6 random leaves to the R branch names_library is an optional argument. If no names are provided, they will be generated randomly.
R.populate(6, names_library=["r1","r2","r3","r4","r5","r6"])
for node in t.traverse():
print(node.get_closest_leaf())
注意: 如果内部节点是终端节点,它将有 2 个最近的叶子,用 node.get_children()
获取它们