如何使用pyclustering lib计算k-mediod聚类的Silhouette系数?
How to calculate Silhouette coefficient for k-mediod clustering using pyclustering lib?
我喜欢在数据集上尝试 k-mediod 聚类方法 (PAM) https://archive.ics.uci.edu/ml/datasets/seeds
我不知道除 pyclustering 之外是否还有其他库用于此目的。无论如何,我如何使用这个库为聚类计算 Silhouette 系数?它不提供像 sklearn 的 k-means 这样的方法。
来自documentation, you can use sklearn.metrics.silhouette_score(X, labels, metric=’euclidean’, sample_size=None, random_state=None, **kwds)
. This function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples
. I also recommend to see this vignette。里面有一个很好的例子供你测试。
从 0.8.2 开始也可以通过 pyclustering,这里是文档中的示例:
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.silhouette import silhouette
from pyclustering.samples.definitions import SIMPLE_SAMPLES
from pyclustering.utils import read_sample
# Read data 'SampleSimple3' from Simple Sample collection.
sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
# Prepare initial centers
centers = kmeans_plusplus_initializer(sample, 4).initialize()
# Perform cluster analysis
kmeans_instance = kmeans(sample, centers)
kmeans_instance.process();
clusters = kmeans_instance.get_clusters()
# Calculate Silhouette score
score = silhouette(sample, clusters).process().get_score()
如果是 PAM,您需要更改最后一部分:
...
medoids = kmeans_plusplus_initializer(sample, 4).initialize(return_index=True)
kmedoids_instance = kmedoids(sample, medoids)
clusters = kmedoids_instance.process().get_clusters()
score = silhouette(sample, clusters).process().get_score()
我喜欢在数据集上尝试 k-mediod 聚类方法 (PAM) https://archive.ics.uci.edu/ml/datasets/seeds
我不知道除 pyclustering 之外是否还有其他库用于此目的。无论如何,我如何使用这个库为聚类计算 Silhouette 系数?它不提供像 sklearn 的 k-means 这样的方法。
来自documentation, you can use sklearn.metrics.silhouette_score(X, labels, metric=’euclidean’, sample_size=None, random_state=None, **kwds)
. This function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples
. I also recommend to see this vignette。里面有一个很好的例子供你测试。
从 0.8.2 开始也可以通过 pyclustering,这里是文档中的示例:
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.silhouette import silhouette
from pyclustering.samples.definitions import SIMPLE_SAMPLES
from pyclustering.utils import read_sample
# Read data 'SampleSimple3' from Simple Sample collection.
sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
# Prepare initial centers
centers = kmeans_plusplus_initializer(sample, 4).initialize()
# Perform cluster analysis
kmeans_instance = kmeans(sample, centers)
kmeans_instance.process();
clusters = kmeans_instance.get_clusters()
# Calculate Silhouette score
score = silhouette(sample, clusters).process().get_score()
如果是 PAM,您需要更改最后一部分:
...
medoids = kmeans_plusplus_initializer(sample, 4).initialize(return_index=True)
kmedoids_instance = kmedoids(sample, medoids)
clusters = kmedoids_instance.process().get_clusters()
score = silhouette(sample, clusters).process().get_score()