如何计算 python 中的 Wemmert-Gancarski 指数?

How to compute the Wemmert-Gancarski Index in python?

问题:

我正在尝试为 Python 中的给定聚类解决方案计算 Wemmert-Gancarski 索引。

但是,我无法计算索引的 $R(M)$ 部分的分母 - 1.2.26 - 因为我似乎找不到计算观察值与其他聚类的质心之间的最小距离的方法。

$R(M)$ 是点 $M$ 到它所属簇的质心的距离与点到所有其他簇的质心之间的最小距离之间的商。

我的尝试:

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances, pairwise_distances
from scipy.spatial.distance import pdist, cdist, squareform, euclidean
from sklearn.datasets import load_iris

iris = load_iris()
XIris = iris.data  # we only take the first two features.
yIris = iris.target

inter = []
intra = []
centroidsIn = np.array(np.zeros(shape=(k, XIris.shape[1])))
centroidsOut = np.array(np.zeros(shape=(k, XIris.shape[1])))
quotient = []
nk = np.bincount(yIris)

for i in range(n_labels):
    inCluster = XIris[yIris == i]
    outCluster = XIris[yIris != i]
    centroidsIn[i] = np.mean(inCluster, axis=0)
    centroidsOut[i] = np.mean(outCluster, axis=0)
    intra.append(cdist(inCluster, centroidsIn, 'euclidean'))
    inter.append(cdist(inCluster, centroidsOut, 'euclidean'))

quotient = np.divide(intra, inter)
print(quotient)

使用 Fisher's Iris 数据集的 k=3 聚类解决方案的真实 WG 索引是 0.666

如有任何提示,我们将不胜感激。

更新:

我设法解决了整个索引,包括上面的原始问题。

事实证明,我不需要自己找到分母,而是只需要计算从点到每个聚类的距离矩阵。

那么,最小的距离与分子有关,次小的与分母有关。

之后,WG指数的计算就很简单了。

支持代码:

def wemmert_gancarski_index(X, labels, n_clusters=None, min_nc=None):
"""
The Wemmert-Gancarski Index, a measure of compactness.

The W-G index is built using the quotients of distances between the points and the barycenters
of all of the clusters.

If the mean of the quotient is greater than :math:`1`, it is ignored, thus it is a weighted mean.

**Maximum value** indicates the optimal number of clusters.

Parameters
----------
X : array-like or dataframe, with **shape:** *(n_samples, n_features)*

    An array / dataframe of observations used to compute the W-G index.

labels : array-like, with **shape:** *(n_samples,)*

    An array / list of labels represented by integers.

n_clusters : int, optional

    The number of clusters to compute the index for.

Returns
-------
The Wemmert-Gancarski Index.

$
"""
# Checking for valid inputs:
def check(labels, n_clusters=None, min_nc=None):
    if n_clusters is None and (
            isinstance(labels, np.ndarray) or isinstance(labels, pd.DataFrame)) and min_nc is None:
        use_labels = labels
        return use_labels
    elif isinstance(labels, list) and n_clusters is not None and min_nc is not None:
        use_labels = self.get_labels(labels, n_clusters=n_clusters, min_nc=min_nc, need="Single")
        use_labels = np.asarray(use_labels)
        return use_labels
    else:
        raise ValueError(f"Please provide either an array of labels (without the other arguments) "
                            f"or (a list of labels, K, min_nc)")

use_labels = check(labels, n_clusters=n_clusters, min_nc=min_nc)

# Calculate the distance between each point and a cluster's centroid, given a dataframe and labels:
def dist_from_centroid(X, labels):
    centroids = centers2(X, labels)
    # Get the distance from each point to each centroid:
    distances = cdist(X, centroids, metric='euclidean')
    return distances

dists = dist_from_centroid(X, use_labels)
dists

intra, inter = [], []

for row in dists:
    inter.append(sorted(row)[1])  # Get the second smallest distance
    intra.append(np.min(row))  # Get the smallest distance

# Compute the quotient of distances between each point and a cluster's centroid:
RM = np.divide(intra, inter)

# Given a vector of shape (n_samples, 1) and the size of each cluster, nk, return a new array of shape (n_samples / nk, nk):
def chunk(vec, chunk_size):
    return np.array([vec[i:i + chunk_size] for i in range(0, len(vec), chunk_size)])

nk = len(np.unique(use_labels))
RMi = chunk(RM, nk)

# Compute 1 - the mean of the quotient of distances between each point and a cluster's centroid:
meanDiff = 1 - np.mean(RMi.transpose(), axis=1)

# Only select the values greater than 0:
Jk = meanDiff[meanDiff > 0]

WG = np.sum([i * j for i, j in zip(np.bincount(use_labels), Jk)]) / len(use_labels)

return WG

事实证明,答案只需要从不同的角度看商数,其他一切都水到渠成。