使用 DBSCAN 进行聚类:如果不预先设置聚类数,如何训练模型?
Clustering with DBSCAN: How to train a model if you dont set the number of clusters in advance?
我正在使用 sklearn 的内置数据集 iris 进行聚类。在 KMeans 中,我预先设置了集群的数量,但对于 DBSCAN 则不然。不提前设置簇数如何训练模型?
我试过了:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotib inline
from sklearn.cluster import DBSCAN,MeanShift
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.metrics import accuracy_score,confusion_matrix
iris = load_iris()
X = iris.data
y = iris.target
dbscan = DBSCAN(eps=0.3,min_samples=10)
dbscan.fit(X,y)
我卡住了!
DBSCAN on Kmeans 的优点之一是您不需要将聚类数指定为超参数。 DBSCAN 中最重要的参数是 epsilon,它直接影响最终的簇数。
DBSCAN 是一种聚类算法,因此它不使用标签 y
。的确,您可以将其 fit
方法用作 .fit(X, y)
但是,根据 docs:
y: Ignored
Not used, present here for API consistency by convention.
DBSCAN的另一个特点是,相对于KMeans等算法,它不以簇数作为输入;相反,它还自己估计他们的人数。
弄清楚这一点后,让我们用虹膜数据调整 documentation demo:
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
X, labels_true = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
# Compute DBSCAN
db = DBSCAN(eps=0.5,min_samples=5) # default parameter values
db.fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
结果:
Estimated number of clusters: 2
Estimated number of noise points: 17
Homogeneity: 0.560
Completeness: 0.657
V-measure: 0.604
Adjusted Rand Index: 0.521
Adjusted Mutual Information: 0.599
Silhouette Coefficient: 0.486
让我们绘制它们:
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
就是这样。
与所有聚类算法一样,监督学习的常用概念,如 train/test 拆分、使用未见数据进行预测、交叉验证等 不成立 。此类无监督方法可能在初始探索性数据分析 (EDA) 中有用,以便让我们对数据有一个总体了解 - 但是,正如您可能已经注意到的那样,此类分析的结果不一定对监督问题:在这里,尽管我们的鸢尾花数据集中存在 3 个标签,但该算法只发现了 2 个簇。
... 这当然可能会发生变化,具体取决于模型参数。实验...
我正在使用 sklearn 的内置数据集 iris 进行聚类。在 KMeans 中,我预先设置了集群的数量,但对于 DBSCAN 则不然。不提前设置簇数如何训练模型?
我试过了:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotib inline
from sklearn.cluster import DBSCAN,MeanShift
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.metrics import accuracy_score,confusion_matrix
iris = load_iris()
X = iris.data
y = iris.target
dbscan = DBSCAN(eps=0.3,min_samples=10)
dbscan.fit(X,y)
我卡住了!
DBSCAN on Kmeans 的优点之一是您不需要将聚类数指定为超参数。 DBSCAN 中最重要的参数是 epsilon,它直接影响最终的簇数。
DBSCAN 是一种聚类算法,因此它不使用标签 y
。的确,您可以将其 fit
方法用作 .fit(X, y)
但是,根据 docs:
y: Ignored
Not used, present here for API consistency by convention.
DBSCAN的另一个特点是,相对于KMeans等算法,它不以簇数作为输入;相反,它还自己估计他们的人数。
弄清楚这一点后,让我们用虹膜数据调整 documentation demo:
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
X, labels_true = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
# Compute DBSCAN
db = DBSCAN(eps=0.5,min_samples=5) # default parameter values
db.fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
结果:
Estimated number of clusters: 2
Estimated number of noise points: 17
Homogeneity: 0.560
Completeness: 0.657
V-measure: 0.604
Adjusted Rand Index: 0.521
Adjusted Mutual Information: 0.599
Silhouette Coefficient: 0.486
让我们绘制它们:
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
就是这样。
与所有聚类算法一样,监督学习的常用概念,如 train/test 拆分、使用未见数据进行预测、交叉验证等 不成立 。此类无监督方法可能在初始探索性数据分析 (EDA) 中有用,以便让我们对数据有一个总体了解 - 但是,正如您可能已经注意到的那样,此类分析的结果不一定对监督问题:在这里,尽管我们的鸢尾花数据集中存在 3 个标签,但该算法只发现了 2 个簇。
... 这当然可能会发生变化,具体取决于模型参数。实验...