Python / ete3:将最密切相关的叶子定位到系统发育树中的特定物种

Python / ete3: Target most closely related leaf to a specific species in a phylogenetic tree

我正在使用 Python 包 ete3。我有树,例如:

((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);

我希望看到与树中特定节点最相关的叶子(这里的树是 Species1_order1)。 在示例中,最密切相关的叶子是 Species2_order2/ Species3_order2Species4_order3/Species5_order5.

代码:

tree = ete3.Tree('((Species1_order1, \
                    (Species2_order2, Species3_order2)), \
                   Species4_order3, Species5_order5);')

新例子:

    tree=ete3.Tree('((((((A,B),C),D),(E,F)),G),(H,I));')

我得到的结果是:

         A    B    C    D    E    F    G    H    I
    A  0.0  2.0  3.0  4.0  6.0  6.0  6.0  8.0  8.0
    B  2.0  0.0  3.0  4.0  6.0  6.0  6.0  8.0  8.0
    C  3.0  3.0  0.0  3.0  5.0  5.0  5.0  7.0  7.0
    D  4.0  4.0  3.0  0.0  4.0  4.0  4.0  6.0  6.0
    E  6.0  6.0  5.0  4.0  0.0  2.0  4.0  6.0  6.0
    F  6.0  6.0  5.0  4.0  2.0  0.0  4.0  6.0  6.0
    G  6.0  6.0  5.0  4.0  4.0  4.0  0.0  4.0  4.0
    H  8.0  8.0  7.0  6.0  6.0  6.0  4.0  0.0  2.0
    I  8.0  8.0  7.0  6.0  6.0  6.0  4.0  2.0  0.0

但是例如,E 和 F 与树中的 A、B、C 和 D 的距离相等,结果它们看起来比 D 更远。

一个好的矩阵结果应该是:


        A   B   C   D   E   F   G   H   I
    A   0   1   2   3   4   4   5   6   6
    B   1   0   2   3   4   4   5   6   6
    C   2   2   0   3   4   4   5   6   6
    D   3   3   3   0   4   4   5   6   6
    E   4   4   4   4   0   1   5   6   6
    F   4   4   4   4   1   0   5   6   6
    G   5   5   5   5   5   5   0   6   6
    H   6   6   6   6   6   6   6   0   1
    I   6   6   6   6   6   6   6   1   0

是吧?

正如评论中所讨论的,ete3 给了我们一个名为 Tree.get_closest_leaf 的函数,但它的输出不是预期的(我不确定这个值在这里代表什么):

>>> t=ete3.Tree('((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);')
>>> t.get_closest_leaf('Species2_order2')
(Tree node 'Species4_order3' (0x115b2f29), 0.0)

相反,您可以像这样获得节点距离:

import ete3
import pandas as pd

def make_matrix(tree):
    def get_root_path(node):
        root_path = [node]
        if node.up:
            root_path.extend(get_root_path(node.up))
        return root_path
    leaves = tree.get_leaves()
    leaf_ct = len(leaves)
    paths = {node.name: set(get_root_path(node)) for node in leaves}
    col_lbls = [leaf.name for leaf in leaves]
    dist_matrix = pd.np.array([pd.np.zeros(leaf_ct)] * leaf_ct)
    df = pd.DataFrame(dist_matrix, index=col_lbls, columns=col_lbls)
    for node1_name, col in df.iteritems():
        for node2_name in col.keys():
            path = paths[node2_name].symmetric_difference(paths[node1_name])
            dist = sum(node.dist for node in path)
            df.at[node1_name, node2_name] = dist
            df.at[node2_name, node1_name] = dist
    return df

注意:出于多种原因,这是一个次优的解决方案,但这个问题并不是要求最有效的解决方案。有关系统发育距离矩阵方法的更多信息,请参阅 this link

此解决方案还使用了 pandas,这有点矫枉过正,因为它实际上只是为了方便 row/column 标签。删除 pandas 依赖项并改为使用本机列表并不难。

这是输出:

>>> tree=ete3.Tree('((Species1_order1, (Species2_order2, Species3_order2)), Species4_order3, Species5_order5);')
>>> make_matrix(tree)
                 Species1_order1  Species2_order2  Species3_order2  Species4_order3  Species5_order5
Species1_order1              0.0              3.0              3.0              3.0              3.0
Species2_order2              3.0              0.0              2.0              4.0              4.0
Species3_order2              3.0              2.0              0.0              4.0              4.0
Species4_order3              3.0              4.0              4.0              0.0              2.0
Species5_order5              3.0              4.0              4.0              2.0              0.0

对于发布的更新,我没有发现任何错误。它似乎给出了正确的结果。这是 ete3 渲染的树(我突出显示了从 Interest_sequenceRhopalosiphum_maidis_Hemiptera 的距离内计算的 4 跳):

这里是 Interest_sequence 对应的矩阵列:

>>> m['Interest_sequence']
Rhopalosiphum_maidis__Hemiptera            4.0
Drosophila_novamexicana__Hemiptera         5.0
Drosophila_arizonae__Hemiptera             6.0
Drosophila_navojoa__Hemiptera              6.0
Interest_sequence                          0.0
Heliothis_virescens_droso_3a__nan          5.0
Mythimna_separata_droso__nan               6.0
Heliothis_virescens_droso_3i__nan          6.0
Scaptodrosophila_lebanonensis__Diptera     5.0
Mythimna_unipuncta_droso_A__nan            6.0
Xestia_c-nigrum_droso__nan                 8.0
Helicoverpa_armigera_droso__nan            8.0
Mocis_latipes_droso__nan                   7.0
Drosophila_busckii__Diptera                4.0
Drosophila_bipectinata__Diptera            5.0
Drosophila_mojavensis__Diptera             7.0
Drosophila_yakuba__Diptera                 7.0
Drosophila_hydei__Diptera                  7.0
Drosophila_serrata__Diptera                8.0
Drosophila_takahashii__Diptera             9.0
Drosophila_eugracilis__Diptera            11.0
Drosophila_ficusphila__Diptera            11.0
Drosophila_erecta__Diptera                12.0
Drosophila_melanogaster__Diptera          13.0
Sequence_A_nan__nan                       14.0
Drosophila_sechellia__Diptera             15.0
Drosophila_simulans__Diptera              15.0
Drosophila_suzukii__Diptera               12.0
Drosophila_biarmipes__Diptera             12.0
Name: Interest_sequence, dtype: float64

ete3 函数按预期工作,您只需要按如下方式遍历树的节点

from ete3 import Tree t = Tree() # Creates an empty tree 
A = t.add_child(name="A") # Adds a new child to the current tree root and returns it 
B = t.add_child(name="B") # Adds a second child to the current tree root and returns it 
C = A.add_child(name="C") # Adds a new child to one of the branches D =
C.add_sister(name="D") # Adds a second child to same branch as before, but using a sister as the starting point 
R = A.add_child(name="R") # Adds a third child to the branch. Multifurcations are supported
# Next, I add 6 random leaves to the R branch names_library is an optional argument. If no names are provided, they will be generated randomly. 
R.populate(6, names_library=["r1","r2","r3","r4","r5","r6"])
for node in t.traverse():
    print(node.get_closest_leaf())

注意: 如果内部节点是终端节点,它将有 2 个最近的叶子,用 node.get_children()

获取它们