分层聚类的三角形与方形距离矩阵 Python?
Triangle vs. Square distance matrix for Hierarchical Clustering Python?
我一直在试验 Hierarchical Clustering
并且在 R
中非常简单 hclust(as.dist(X),method="average")
。我在 Python
中找到了一种方法,该方法也非常简单,只是我对输入距离矩阵的情况有些困惑。
我有一个相似矩阵(DF_c93tom
w/ 一个较小的测试版本 DF_sim
),我将其转换为相异矩阵 DF_dissm = 1 - DF_sim
。
我用它作为 scipy
到 linkage
的输入,但文档说它采用方形或三角形矩阵。我得到一个不同的集群来输入 lower triangle
、upper triangle
和 square matrix
。为什么是这样?它需要文档中的上三角,但下三角集群看起来非常相似。
我的问题是,为什么所有的集群都不一样?哪一个是正确的?
这是 linkage
的输入距离矩阵的文档
y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix.
这是我的代码:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
%matplotlib inline
#Test Data
DF_sim = DF_c93tom.iloc[:10,:10] #Similarity Matrix
DF_sim.columns = DF_sim.index = range(10)
#print(DF_test)
# 0 1 2 3 4 5 6 7 8 9
# 0 1.000000 0 0.395833 0.083333 0 0 0 0 0 0
# 1 0.000000 1 0.000000 0.000000 0 0 0 0 0 0
# 2 0.395833 0 1.000000 0.883792 0 0 0 0 0 0
# 3 0.083333 0 0.883792 1.000000 0 0 0 0 0 0
# 4 0.000000 0 0.000000 0.000000 1 0 0 0 0 0
# 5 0.000000 0 0.000000 0.000000 0 1 0 0 0 0
# 6 0.000000 0 0.000000 0.000000 0 0 1 0 0 0
# 7 0.000000 0 0.000000 0.000000 0 0 0 1 0 0
# 8 0.000000 0 0.000000 0.000000 0 0 0 0 1 0
# 9 0.000000 0 0.000000 0.000000 0 0 0 0 0 1
#Dissimilarity Matrix
DF_dissm = 1 - DF_sim
#Redundant Matrix
#np.tril(DF_dissm).T == np.triu(DF_dissm)
#True for all values
#Hierarchical Clustering for square and triangle matrices
fig_1 = plt.figure(1)
plt.title("Square")
Z_square = linkage((DF_dissm.values),method="average")
dendrogram(Z_square)
fig_2 = plt.figure(2)
plt.title("Triangle Upper")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
dendrogram(Z_triu)
fig_3 = plt.figure(3)
plt.title("Triangle Lower")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
dendrogram(Z_tril)
plt.show()
当二维数组作为第一个参数传递给 scipy.cluster.hierarchy.linkage 时,
它被视为一系列观察值,scipy.spatial.pdist
is used 将其转换为观察值之间成对距离的序列。
关于此行为有一个 github issue,因为它意味着传递 "distance matrix" 例如 DF_dissm.values
(静默地)会产生 不正确的结果.
所以 the upshot of this 是 none 个
Z_square = linkage((DF_dissm.values),method="average")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
产生期望的结果。 改为使用
-
h, w = arr.shape
Z = linkage(arr[np.triu_indices(h, 1)], method="average")
-
from scipy.spatial import distance as ssd
Z = linkage(ssd.squareform(arr), method="average")
或将spatial.distance.pdist
应用于原始点:
Z = hierarchy.linkage(ssd.pdist(points), method="average")
或传递二维数组points
:
Z = hierarchy.linkage(points, method="average")
import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster import hierarchy as hier
from scipy.spatial import distance as ssd
np.random.seed(2016)
points = np.random.random((10, 2))
arr = ssd.cdist(points, points)
fig, ax = plt.subplots(nrows=4)
ax[0].set_title("condensed upper triangular")
Z = hier.linkage(arr[np.triu_indices(arr.shape[0], 1)], method="average")
hier.dendrogram(Z, ax=ax[0])
ax[1].set_title("squareform")
Z = hier.linkage(ssd.squareform(arr), method="average")
hier.dendrogram(Z, ax=ax[1])
ax[2].set_title("pdist")
Z = hier.linkage(ssd.pdist(points), method="average")
hier.dendrogram(Z, ax=ax[2])
ax[3].set_title("sequence of observations")
Z = hier.linkage(points, method="average")
hier.dendrogram(Z, ax=ax[3])
plt.show()
我一直在试验 Hierarchical Clustering
并且在 R
中非常简单 hclust(as.dist(X),method="average")
。我在 Python
中找到了一种方法,该方法也非常简单,只是我对输入距离矩阵的情况有些困惑。
我有一个相似矩阵(DF_c93tom
w/ 一个较小的测试版本 DF_sim
),我将其转换为相异矩阵 DF_dissm = 1 - DF_sim
。
我用它作为 scipy
到 linkage
的输入,但文档说它采用方形或三角形矩阵。我得到一个不同的集群来输入 lower triangle
、upper triangle
和 square matrix
。为什么是这样?它需要文档中的上三角,但下三角集群看起来非常相似。
我的问题是,为什么所有的集群都不一样?哪一个是正确的?
这是 linkage
y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix.
这是我的代码:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
%matplotlib inline
#Test Data
DF_sim = DF_c93tom.iloc[:10,:10] #Similarity Matrix
DF_sim.columns = DF_sim.index = range(10)
#print(DF_test)
# 0 1 2 3 4 5 6 7 8 9
# 0 1.000000 0 0.395833 0.083333 0 0 0 0 0 0
# 1 0.000000 1 0.000000 0.000000 0 0 0 0 0 0
# 2 0.395833 0 1.000000 0.883792 0 0 0 0 0 0
# 3 0.083333 0 0.883792 1.000000 0 0 0 0 0 0
# 4 0.000000 0 0.000000 0.000000 1 0 0 0 0 0
# 5 0.000000 0 0.000000 0.000000 0 1 0 0 0 0
# 6 0.000000 0 0.000000 0.000000 0 0 1 0 0 0
# 7 0.000000 0 0.000000 0.000000 0 0 0 1 0 0
# 8 0.000000 0 0.000000 0.000000 0 0 0 0 1 0
# 9 0.000000 0 0.000000 0.000000 0 0 0 0 0 1
#Dissimilarity Matrix
DF_dissm = 1 - DF_sim
#Redundant Matrix
#np.tril(DF_dissm).T == np.triu(DF_dissm)
#True for all values
#Hierarchical Clustering for square and triangle matrices
fig_1 = plt.figure(1)
plt.title("Square")
Z_square = linkage((DF_dissm.values),method="average")
dendrogram(Z_square)
fig_2 = plt.figure(2)
plt.title("Triangle Upper")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
dendrogram(Z_triu)
fig_3 = plt.figure(3)
plt.title("Triangle Lower")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
dendrogram(Z_tril)
plt.show()
当二维数组作为第一个参数传递给 scipy.cluster.hierarchy.linkage 时,
它被视为一系列观察值,scipy.spatial.pdist
is used 将其转换为观察值之间成对距离的序列。
关于此行为有一个 github issue,因为它意味着传递 "distance matrix" 例如 DF_dissm.values
(静默地)会产生 不正确的结果.
所以 the upshot of this 是 none 个
Z_square = linkage((DF_dissm.values),method="average")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
产生期望的结果。 改为使用
-
h, w = arr.shape Z = linkage(arr[np.triu_indices(h, 1)], method="average")
-
from scipy.spatial import distance as ssd Z = linkage(ssd.squareform(arr), method="average")
或将
spatial.distance.pdist
应用于原始点:Z = hierarchy.linkage(ssd.pdist(points), method="average")
或传递二维数组
points
:Z = hierarchy.linkage(points, method="average")
import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster import hierarchy as hier
from scipy.spatial import distance as ssd
np.random.seed(2016)
points = np.random.random((10, 2))
arr = ssd.cdist(points, points)
fig, ax = plt.subplots(nrows=4)
ax[0].set_title("condensed upper triangular")
Z = hier.linkage(arr[np.triu_indices(arr.shape[0], 1)], method="average")
hier.dendrogram(Z, ax=ax[0])
ax[1].set_title("squareform")
Z = hier.linkage(ssd.squareform(arr), method="average")
hier.dendrogram(Z, ax=ax[1])
ax[2].set_title("pdist")
Z = hier.linkage(ssd.pdist(points), method="average")
hier.dendrogram(Z, ax=ax[2])
ax[3].set_title("sequence of observations")
Z = hier.linkage(points, method="average")
hier.dendrogram(Z, ax=ax[3])
plt.show()