scipy 中树状图坐标与 ClusterNodes 之间的关系

Relation between dendrogram plot coordinates and ClusterNodes in scipy

我正在寻找一种获取 dendrogram plot based on its ClusterNode return by to_tree 中聚类点坐标的方法。

使用 scipy 从以下数据构建树状图:

X = data
Y = pdist(X)
Z = linkage(Y)
dend = dendrogram(Z)
rootnode, nodesList = to_tree(Z, rd=True)

我想做的是构建一个函数 get_coords(somClusterNode),它将 return 元组 (x, y) 指定图中节点的位置。

感谢this answer,我设法弄清楚如何从树状图return值中获取位置,例如:

i, d = list(zip(dend['icoord'], dend['dcoord']))[-1]
x = 0.5 * sum(i[1:3])
y = d[1]
plt.plot(x, y, 'ro')

但我可以找出节点列表排序和 icoord/dcoord 排序之间的关系,以便将一个映射到另一个。

你知道我可以在哪里寻找吗?

感谢您的帮助!

每个树状图仅映射到一棵 ClusterNodes 树,但任何一棵 ClusterNodes 树都可以映射到无限多的树状图。因此,从节点 ID 到 (x,y) 位置的映射可能应该只是树状图数据结构中的另一个字段,而不是 ClusterNode 的函数。因此,我没有定义函数 get_coords,而是将字典附加到 dend,将节点 ID 映射到 (x,y) 坐标。您可以使用

访问职位
x,y = dend['node_id_to_coord'][node_id] # node_id is an integer as returned by ClusterNode.id

代码:

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram, to_tree
from scipy.spatial.distance import pdist

# create some random data
X = np.random.rand(10, 3)

# get dendrogram
Z = linkage(pdist(X), method="ward")
dend = dendrogram(Z)

# ----------------------------------------
# get leave coordinates, which are at y == 0

def flatten(l):
    return [item for sublist in l for item in sublist]
X = flatten(dend['icoord'])
Y = flatten(dend['dcoord'])
leave_coords = [(x,y) for x,y in zip(X,Y) if y==0]

# in the dendogram data structure,
# leave ids are listed in ascending order according to their x-coordinate
order = np.argsort([x for x,y in leave_coords])
id_to_coord = dict(zip(dend['leaves'], [leave_coords[idx] for idx in order])) # <- main data structure

# ----------------------------------------
# get coordinates of other nodes

# this should work but doesn't:

# # traverse tree from leaves upwards and populate mapping ID -> (x,y);
# # use linkage matrix to traverse the tree optimally
# # (each row in the linkage matrix corresponds to a row in dend['icoord'] and dend['dcoord'])
# root_node, node_list = to_tree(Z, rd=True)
# for ii, (X, Y) in enumerate(zip(dend['icoord'], dend['dcoord'])):
#     x = (X[1] + X[2]) / 2
#     y = Y[1] # or Y[2]
#     node_id = ii + len(dend['leaves'])
#     id_to_coord[node_id] = (x, y)

# so we need to do it the hard way:

# map endpoint of each link to coordinates of parent node
children_to_parent_coords = dict()
for i, d in zip(dend['icoord'], dend['dcoord']):
    x = (i[1] + i[2]) / 2
    y = d[1] # or d[2]
    parent_coord = (x, y)
    left_coord = (i[0], d[0])
    right_coord = (i[-1], d[-1])
    children_to_parent_coords[(left_coord, right_coord)] = parent_coord

# traverse tree from leaves upwards and populate mapping ID -> (x,y)
root_node, node_list = to_tree(Z, rd=True)
ids_left = range(len(dend['leaves']), len(node_list))

while len(ids_left) > 0:

    for ii, node_id in enumerate(ids_left):
        node = node_list[node_id]
        if (node.left.id in id_to_coord) and (node.right.id in id_to_coord):
            left_coord = id_to_coord[node.left.id]
            right_coord = id_to_coord[node.right.id]
            id_to_coord[node_id] = children_to_parent_coords[(left_coord, right_coord)]

    ids_left = [node_id for node_id in range(len(node_list)) if not node_id in id_to_coord]

# plot result on top of dendrogram
ax = plt.gca()
for node_id, (x, y) in id_to_coord.iteritems():
    if not node_list[node_id].is_leaf():
        ax.plot(x, y, 'ro')
        ax.annotate(str(node_id), (x, y), xytext=(0, -8),
                    textcoords='offset points',
                    va='top', ha='center')

dend['node_id_to_coord'] = id_to_coord

还有另一种方法:

树状图的 ID 似乎是由从右到左的反向遍历树生成的。这允许我们构建从 node.idicoorddcoord 的索引的转换,如下所示:

def rl_traversal(node):
    # skipping leaves
    if not node.is_leaf():
        yield node.id
        yield from rl_traversal(node.right)
        yield from rl_traversal(node.left)

id_map = dict(zip( rl_traversal(root), reversed(range(root.get_count()-1))) ))
# id_map[node_id] = dendogram_id

节点的坐标可以通过dendo['icoord'][id_map[node_id]]

得到