分层聚类的三角形与方形距离矩阵 Python?

Triangle vs. Square distance matrix for Hierarchical Clustering Python?

我一直在试验 Hierarchical Clustering 并且在 R 中非常简单 hclust(as.dist(X),method="average") 。我在 Python 中找到了一种方法,该方法也非常简单,只是我对输入距离矩阵的情况有些困惑。

我有一个相似矩阵(DF_c93tom w/ 一个较小的测试版本 DF_sim),我将其转换为相异矩阵 DF_dissm = 1 - DF_sim

我用它作为 scipylinkage 的输入,但文档说它采用方形或三角形矩阵。我得到一个不同的集群来输入 lower triangleupper trianglesquare matrix。为什么是这样?它需要文档中的上三角,但下三角集群看起来非常相似。

我的问题是,为什么所有的集群都不一样?哪一个是正确的?

这是 linkage

的输入距离矩阵的文档
y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. 

这是我的代码:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage

%matplotlib inline

#Test Data
DF_sim = DF_c93tom.iloc[:10,:10] #Similarity Matrix
DF_sim.columns = DF_sim.index = range(10) 
#print(DF_test)
#           0  1         2         3  4  5  6  7  8  9
# 0  1.000000  0  0.395833  0.083333  0  0  0  0  0  0
# 1  0.000000  1  0.000000  0.000000  0  0  0  0  0  0
# 2  0.395833  0  1.000000  0.883792  0  0  0  0  0  0
# 3  0.083333  0  0.883792  1.000000  0  0  0  0  0  0
# 4  0.000000  0  0.000000  0.000000  1  0  0  0  0  0
# 5  0.000000  0  0.000000  0.000000  0  1  0  0  0  0
# 6  0.000000  0  0.000000  0.000000  0  0  1  0  0  0
# 7  0.000000  0  0.000000  0.000000  0  0  0  1  0  0
# 8  0.000000  0  0.000000  0.000000  0  0  0  0  1  0
# 9  0.000000  0  0.000000  0.000000  0  0  0  0  0  1

#Dissimilarity Matrix
DF_dissm = 1 - DF_sim

#Redundant Matrix
#np.tril(DF_dissm).T == np.triu(DF_dissm)
#True for all values

#Hierarchical Clustering for square and triangle matrices
fig_1 = plt.figure(1)
plt.title("Square")
Z_square = linkage((DF_dissm.values),method="average")
dendrogram(Z_square)

fig_2 = plt.figure(2)
plt.title("Triangle Upper")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
dendrogram(Z_triu)

fig_3 = plt.figure(3)
plt.title("Triangle Lower")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
dendrogram(Z_tril)

plt.show()

当二维数组作为第一个参数传递给 scipy.cluster.hierarchy.linkage 时, 它被视为一系列观察值,scipy.spatial.pdist is used 将其转换为观察值之间成对距离的序列。

关于此行为有一个 github issue,因为它意味着传递 "distance matrix" 例如 DF_dissm.values(静默地)会产生 不正确的结果.

所以 the upshot of this 是 none 个

Z_square = linkage((DF_dissm.values),method="average")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")

产生期望的结果。 改为使用

  • np.triu_indices:

    h, w = arr.shape
    Z = linkage(arr[np.triu_indices(h, 1)], method="average")
    
  • spatial.distance.squareform:

    from scipy.spatial import distance as ssd
    Z = linkage(ssd.squareform(arr), method="average")
    
  • 或将spatial.distance.pdist应用于原始点:

    Z = hierarchy.linkage(ssd.pdist(points), method="average")
    
  • 或传递二维数组points:

    Z = hierarchy.linkage(points, method="average")
    

import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster import hierarchy as hier
from scipy.spatial import distance as ssd
np.random.seed(2016)

points = np.random.random((10, 2))
arr = ssd.cdist(points, points)

fig, ax = plt.subplots(nrows=4)

ax[0].set_title("condensed upper triangular")
Z = hier.linkage(arr[np.triu_indices(arr.shape[0], 1)], method="average")
hier.dendrogram(Z, ax=ax[0])

ax[1].set_title("squareform")
Z = hier.linkage(ssd.squareform(arr), method="average")
hier.dendrogram(Z, ax=ax[1])

ax[2].set_title("pdist")
Z = hier.linkage(ssd.pdist(points), method="average")
hier.dendrogram(Z, ax=ax[2])

ax[3].set_title("sequence of observations")
Z = hier.linkage(points, method="average")
hier.dendrogram(Z, ax=ax[3])

plt.show()