sklearn agglomerative clustering:动态更新集群数量
sklearn agglomerative clustering: dynamically updating the number of clusters
sklearn.cluster.AgglomerativeClustering 的文档提到,
when varying the number of clusters and using caching,
it may be advantageous to compute the full tree.
这似乎意味着可以先计算整棵树,然后根据需要快速更新所需簇的数量,而无需重新计算树(使用缓存)。
但是,似乎没有记录这种更改簇数的过程。我想这样做,但不确定如何进行。
更新:澄清一下,拟合方法不将簇数作为输入:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering.fit
您使用参数 memory = 'mycachedir'
设置了一个缓存目录,然后如果您设置 compute_full_tree=True
,当您使用不同的 n_clusters
值重新运行 fit
时,它将使用缓存的树而不是每次都重新计算。举例说明如何使用 sklearn 的网格搜索 API:
from sklearn.cluster import AgglomerativeClustering
from sklearn.grid_search import GridSearchCV
ac = AgglomerativeClustering(memory='mycachedir',
compute_full_tree=True)
classifier = GridSearchCV(ac,
{n_clusters: range(2,6)},
scoring = 'adjusted_rand_score',
n_jobs=-1, verbose=2)
classifier.fit(X,y)
我知道这是一个老问题,但下面的解决方案可能会有所帮助
# scores = input matrix
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import cut_tree
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import euclidean_distances
linkage_mat = linkage(scores, method="ward")
euc_scores = euclidean_distances(scores)
n_l = 2
n_h = scores.shape[0]
silh_score = -2
# Selecting the best number of clusters based on the silhouette score
for i in range(n_l, n_h):
local_labels = list(cut_tree(linkage_mat, n_clusters=i).flatten())
sc = silhouette_score(
euc_scores,
metric="precomputed",
labels=local_labels,
random_state=42)
if silh_score < sc:
silh_score = sc
labels = local_labels
n_clusters = len(set(labels))
print(f"Optimal number of clusters: {n_clusters}")
print(f"Best silhouette score: {silh_score}")
# ...
sklearn.cluster.AgglomerativeClustering 的文档提到,
when varying the number of clusters and using caching, it may be advantageous to compute the full tree.
这似乎意味着可以先计算整棵树,然后根据需要快速更新所需簇的数量,而无需重新计算树(使用缓存)。
但是,似乎没有记录这种更改簇数的过程。我想这样做,但不确定如何进行。
更新:澄清一下,拟合方法不将簇数作为输入: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering.fit
您使用参数 memory = 'mycachedir'
设置了一个缓存目录,然后如果您设置 compute_full_tree=True
,当您使用不同的 n_clusters
值重新运行 fit
时,它将使用缓存的树而不是每次都重新计算。举例说明如何使用 sklearn 的网格搜索 API:
from sklearn.cluster import AgglomerativeClustering
from sklearn.grid_search import GridSearchCV
ac = AgglomerativeClustering(memory='mycachedir',
compute_full_tree=True)
classifier = GridSearchCV(ac,
{n_clusters: range(2,6)},
scoring = 'adjusted_rand_score',
n_jobs=-1, verbose=2)
classifier.fit(X,y)
我知道这是一个老问题,但下面的解决方案可能会有所帮助
# scores = input matrix
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import cut_tree
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import euclidean_distances
linkage_mat = linkage(scores, method="ward")
euc_scores = euclidean_distances(scores)
n_l = 2
n_h = scores.shape[0]
silh_score = -2
# Selecting the best number of clusters based on the silhouette score
for i in range(n_l, n_h):
local_labels = list(cut_tree(linkage_mat, n_clusters=i).flatten())
sc = silhouette_score(
euc_scores,
metric="precomputed",
labels=local_labels,
random_state=42)
if silh_score < sc:
silh_score = sc
labels = local_labels
n_clusters = len(set(labels))
print(f"Optimal number of clusters: {n_clusters}")
print(f"Best silhouette score: {silh_score}")
# ...