如何使用pyclustering lib计算k-mediod聚类的Silhouette系数?

How to calculate Silhouette coefficient for k-mediod clustering using pyclustering lib?

我喜欢在数据集上尝试 k-mediod 聚类方法 (PAM) https://archive.ics.uci.edu/ml/datasets/seeds

我不知道除 pyclustering 之外是否还有其他库用于此目的。无论如何,我如何使用这个库为聚类计算 Silhouette 系数?它不提供像 sklearn 的 k-means 这样的方法。

来自documentation, you can use sklearn.metrics.silhouette_score(X, labels, metric=’euclidean’, sample_size=None, random_state=None, **kwds). This function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples. I also recommend to see this vignette。里面有一个很好的例子供你测试。

从 0.8.2 开始也可以通过 pyclustering,这里是文档中的示例:

from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.silhouette import silhouette

from pyclustering.samples.definitions import SIMPLE_SAMPLES
from pyclustering.utils import read_sample

# Read data 'SampleSimple3' from Simple Sample collection.
sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)

# Prepare initial centers
centers = kmeans_plusplus_initializer(sample, 4).initialize()

# Perform cluster analysis
kmeans_instance = kmeans(sample, centers)
kmeans_instance.process();
clusters = kmeans_instance.get_clusters()

# Calculate Silhouette score
score = silhouette(sample, clusters).process().get_score()

如果是 PAM,您需要更改最后一部分:

...
medoids = kmeans_plusplus_initializer(sample, 4).initialize(return_index=True)
kmedoids_instance = kmedoids(sample, medoids)
clusters = kmedoids_instance.process().get_clusters()

score = silhouette(sample, clusters).process().get_score()