如何获得层次平均聚类的差距统计

How to get gap statistic for hierarchical average clustering

我基于'average linkage'进行层次聚类分析,在base r中,我使用

dist_mat <- dist(cdata, method = "euclidean")
hclust_avg <- hclust(dist_mat, method = "average")

我想计算间隙统计数据以确定最佳聚类数。我使用 'cluster' 库和 clusGap 函数。由于我无法通过 hclust 解决方案,也无法在 clusGap 函数中指定平均层次聚类,因此我使用这些行:

cluster_fun <- function(x, k) list(cluster = cutree(hclust(dist(x, method = "euclidean"), method="average"), k = k))
gap_stat <- clusGap(cdata, FUN=cluster_fun, K.max=10, B=50)
print(gap_stat)

但是这里无法查看集群解决方案。所以,我的问题是 - 我可以确定差距统计是根据与 hclust_avg 相同的解决方案计算的吗?

有更好的方法吗?

是的,应该是一样的。在 clusGap 函数中,它会为您提供的每个 k 调用 cluster_fun,然后计算池内聚类周围的平方和,如 paper 中所述 这是调用您的自定义函数的 clusGap 内部调用的代码位:

W.k <- function(X, kk) {
        clus <- if (kk > 1) 
            FUNcluster(X, kk, ...)$cluster
        else rep.int(1L, nrow(X))
        0.5 * sum(vapply(split(ii, clus), function(I) {
            xs <- X[I, , drop = FALSE]
            sum(dist(xs)^d.power/nrow(xs))
        }, 0))
    }

从这里开始计算差距统计数据。

您可以使用一些自定义代码来计算差距统计数据,但是为了可重复性等,使用这个可能更容易?

感谢解决。我必须说这是足够好的解决方案,但您也可以尝试下面给出的代码。

# Gap Statistic for K means
def optimalK(data, nrefs=3, maxClusters=15):
    """
    Calculates KMeans optimal K using Gap Statistic 
    Params:
        data: ndarry of shape (n_samples, n_features)
        nrefs: number of sample reference datasets to create
        maxClusters: Maximum number of clusters to test for
    Returns: (gaps, optimalK)
    """
    gaps = np.zeros((len(range(1, maxClusters)),))
    resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
    for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
        refDisps = np.zeros(nrefs)
# For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
        for i in range(nrefs):
            
            # Create new random reference set
            randomReference = np.random.random_sample(size=data.shape)
            
            # Fit to it
            km = KMeans(k)
            km.fit(randomReference)
            
            refDisp = km.inertia_
            refDisps[i] = refDisp
# Fit cluster to original data and create dispersion
        km = KMeans(k)
        km.fit(data)
        
        origDisp = km.inertia_
# Calculate gap statistic
        gap = np.log(np.mean(refDisps)) - np.log(origDisp)
# Assign this loop's gap statistic to gaps
        gaps[gap_index] = gap
        
        resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)
return (gaps.argmax() + 1, resultsdf)
score_g, df = optimalK(cluster_df, nrefs=5, maxClusters=30)
plt.plot(df['clusterCount'], df['gap'], linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('Gap Statistic');
plt.title('Gap Statistic vs. K');