将距离对转换为距离矩阵以用于层次聚类

Convert distance pairs to distance matrix to use in hierarchical clustering

我正在尝试将字典转换为距离矩阵,然后我可以将其用作层次聚类的输入:我有一个输入:

结果是:

('obj1', 'obj2') 2.0 
('obj3', 'obj4') 1.58
('obj1','obj3') 1.95
('obj2', 'obj3') 1.80

我的问题是如何将其转换为距离矩阵,以便稍后在 scipy 中用于聚类?

使用pandas并解压数据帧:

import pandas as pd

data = {('obj1', 'obj2'): 2.0 ,
('obj3', 'obj4'): 1.58,
('obj1','obj3'): 1.95,
('obj2', 'obj3'): 1.80,}

df = pd.DataFrame.from_dict(data, orient='index')
df.index = pd.MultiIndex.from_tuples(df.index.tolist())
dist_matrix = df.unstack().values

产量

In [15]: dist_matrix
Out[15]:

array([[2.  , 1.95,  nan],
       [ nan, 1.8 ,  nan],
       [ nan,  nan, 1.58]])

这将比发布的其他答案慢,但将确保包括中间对角线上方和下方的值,如果这对您很重要的话:

import pandas as pd

unique_ids = sorted(set([x for y in obj_distance.keys() for x in y]))
df = pd.DataFrame(index=unique_ids, columns=unique_ids)

for k, v in obj_distance.items():
    df.loc[k[0], k[1]] = v
    df.loc[k[1], k[0]] = v

结果:

      obj1 obj2  obj3  obj4
obj1   NaN    2  1.95   NaN
obj2     2  NaN   1.8   NaN
obj3  1.95  1.8   NaN  1.58
obj4   NaN  NaN  1.58   NaN

您说您将使用 scipy 进行聚类,所以我假设这意味着您将使用函数 scipy.cluster.hierarchy.linkage. linkage accepts the distance data in "condensed" form, so you don't have to create the full symmetric distance matrix. (See, e.g., How does condensed distance matrix work? (pdist),以讨论压缩形式。)

因此,您所要做的就是将 obj_distances.values() 转换为已知顺序并将其传递给 linkage。这就是以下代码片段中所做的:

from scipy.cluster.hierarchy import linkage, dendrogram

obj_distances = {
    ('obj2', 'obj3'): 1.8,
    ('obj3', 'obj1'): 1.95,
    ('obj1', 'obj4'): 2.5,
    ('obj1', 'obj2'): 2.0,
    ('obj4', 'obj2'): 2.1,
    ('obj3', 'obj4'): 1.58,
}

# Put each key pair in a canonical order, so we know that if (a, b) is a key,
# then a < b.  If this is already true, then the next three lines can be
# replaced with
#     sorted_keys, distances = zip(*sorted(obj_distances.items()))
# Note: we assume there are no keys where the two objects are the same.
keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))

# linkage accepts the "condensed" format of the distances.
Z = linkage(distances)

# Optional: create a sorted list of the objects.
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))

dendrogram(Z, labels=labels)

树状图: