将距离对转换为距离矩阵以用于层次聚类
Convert distance pairs to distance matrix to use in hierarchical clustering
我正在尝试将字典转换为距离矩阵,然后我可以将其用作层次聚类的输入:我有一个输入:
- 键:长度为 2 的元组与我有距离的对象
value:实际距离值
for k,v in obj_distances.items():
print(k,v)
结果是:
('obj1', 'obj2') 2.0
('obj3', 'obj4') 1.58
('obj1','obj3') 1.95
('obj2', 'obj3') 1.80
我的问题是如何将其转换为距离矩阵,以便稍后在 scipy 中用于聚类?
使用pandas并解压数据帧:
import pandas as pd
data = {('obj1', 'obj2'): 2.0 ,
('obj3', 'obj4'): 1.58,
('obj1','obj3'): 1.95,
('obj2', 'obj3'): 1.80,}
df = pd.DataFrame.from_dict(data, orient='index')
df.index = pd.MultiIndex.from_tuples(df.index.tolist())
dist_matrix = df.unstack().values
产量
In [15]: dist_matrix
Out[15]:
array([[2. , 1.95, nan],
[ nan, 1.8 , nan],
[ nan, nan, 1.58]])
这将比发布的其他答案慢,但将确保包括中间对角线上方和下方的值,如果这对您很重要的话:
import pandas as pd
unique_ids = sorted(set([x for y in obj_distance.keys() for x in y]))
df = pd.DataFrame(index=unique_ids, columns=unique_ids)
for k, v in obj_distance.items():
df.loc[k[0], k[1]] = v
df.loc[k[1], k[0]] = v
结果:
obj1 obj2 obj3 obj4
obj1 NaN 2 1.95 NaN
obj2 2 NaN 1.8 NaN
obj3 1.95 1.8 NaN 1.58
obj4 NaN NaN 1.58 NaN
您说您将使用 scipy 进行聚类,所以我假设这意味着您将使用函数 scipy.cluster.hierarchy.linkage
. linkage
accepts the distance data in "condensed" form, so you don't have to create the full symmetric distance matrix. (See, e.g., How does condensed distance matrix work? (pdist),以讨论压缩形式。)
因此,您所要做的就是将 obj_distances.values()
转换为已知顺序并将其传递给 linkage
。这就是以下代码片段中所做的:
from scipy.cluster.hierarchy import linkage, dendrogram
obj_distances = {
('obj2', 'obj3'): 1.8,
('obj3', 'obj1'): 1.95,
('obj1', 'obj4'): 2.5,
('obj1', 'obj2'): 2.0,
('obj4', 'obj2'): 2.1,
('obj3', 'obj4'): 1.58,
}
# Put each key pair in a canonical order, so we know that if (a, b) is a key,
# then a < b. If this is already true, then the next three lines can be
# replaced with
# sorted_keys, distances = zip(*sorted(obj_distances.items()))
# Note: we assume there are no keys where the two objects are the same.
keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))
# linkage accepts the "condensed" format of the distances.
Z = linkage(distances)
# Optional: create a sorted list of the objects.
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))
dendrogram(Z, labels=labels)
树状图:
我正在尝试将字典转换为距离矩阵,然后我可以将其用作层次聚类的输入:我有一个输入:
- 键:长度为 2 的元组与我有距离的对象
value:实际距离值
for k,v in obj_distances.items(): print(k,v)
结果是:
('obj1', 'obj2') 2.0
('obj3', 'obj4') 1.58
('obj1','obj3') 1.95
('obj2', 'obj3') 1.80
我的问题是如何将其转换为距离矩阵,以便稍后在 scipy 中用于聚类?
使用pandas并解压数据帧:
import pandas as pd
data = {('obj1', 'obj2'): 2.0 ,
('obj3', 'obj4'): 1.58,
('obj1','obj3'): 1.95,
('obj2', 'obj3'): 1.80,}
df = pd.DataFrame.from_dict(data, orient='index')
df.index = pd.MultiIndex.from_tuples(df.index.tolist())
dist_matrix = df.unstack().values
产量
In [15]: dist_matrix
Out[15]:
array([[2. , 1.95, nan],
[ nan, 1.8 , nan],
[ nan, nan, 1.58]])
这将比发布的其他答案慢,但将确保包括中间对角线上方和下方的值,如果这对您很重要的话:
import pandas as pd
unique_ids = sorted(set([x for y in obj_distance.keys() for x in y]))
df = pd.DataFrame(index=unique_ids, columns=unique_ids)
for k, v in obj_distance.items():
df.loc[k[0], k[1]] = v
df.loc[k[1], k[0]] = v
结果:
obj1 obj2 obj3 obj4
obj1 NaN 2 1.95 NaN
obj2 2 NaN 1.8 NaN
obj3 1.95 1.8 NaN 1.58
obj4 NaN NaN 1.58 NaN
您说您将使用 scipy 进行聚类,所以我假设这意味着您将使用函数 scipy.cluster.hierarchy.linkage
. linkage
accepts the distance data in "condensed" form, so you don't have to create the full symmetric distance matrix. (See, e.g., How does condensed distance matrix work? (pdist),以讨论压缩形式。)
因此,您所要做的就是将 obj_distances.values()
转换为已知顺序并将其传递给 linkage
。这就是以下代码片段中所做的:
from scipy.cluster.hierarchy import linkage, dendrogram
obj_distances = {
('obj2', 'obj3'): 1.8,
('obj3', 'obj1'): 1.95,
('obj1', 'obj4'): 2.5,
('obj1', 'obj2'): 2.0,
('obj4', 'obj2'): 2.1,
('obj3', 'obj4'): 1.58,
}
# Put each key pair in a canonical order, so we know that if (a, b) is a key,
# then a < b. If this is already true, then the next three lines can be
# replaced with
# sorted_keys, distances = zip(*sorted(obj_distances.items()))
# Note: we assume there are no keys where the two objects are the same.
keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))
# linkage accepts the "condensed" format of the distances.
Z = linkage(distances)
# Optional: create a sorted list of the objects.
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))
dendrogram(Z, labels=labels)
树状图: