python 无监督学习dbscan scikit应用实例

python unsupervised learning dbscan scikit application example

我有以下列表,我想对其执行无监督学习并使用这些知识来预测测试列表中每个项目的值

#Format [real_runtime, processors, requested_time, score, more_to_be_added]
#some entries from the list

训练数据集

Xsrc = [['354', '2048', '3600', '53.0521472395'], 
      ['605', '2048', '600', '54.8768871369'], 
      ['128', '2048', '600', '51.0'], 
      ['136', '2048', '900', '51.0000000563'], 
      ['19218', '480', '21600', '51.0'], 
      ['15884', '2048', '18000', '51.0'], 
      ['118', '2048', '1500', '51.0'], 
      ['103', '2048', '2100', '51.0000002839'], 
      ['18542', '480', '21600', '51.0000000001'], 
      ['13272', '2048', '18000', '51.0000000001']]

测试数据集

使用集群我想预测新列表的 real_runtime: Xtest= [['-1', '2048', '1500', '51.0000000161'], ['-1', '2048', '10800', '51.0000000002'], ['-1', '512', '21600', '-1'], ['-1', '512', '2700', '51.0000000004'], ['-1, '1024', '21600', '51.1042617556']]

代码:在 python 中使用 scikit 格式化列表和制作集群并绘制集群

from sklearn.feature_selection import VarianceThreshold
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

##Training dataset
Xsrc = [['354', '2048', '3600', '53.0521472395'], 
      ['605', '2048', '600', '54.8768871369'], 
      ['128', '2048', '600', '51.0'], 
      ['136', '2048', '900', '51.0000000563'], 
      ['19218', '480', '21600', '51.0'], 
      ['15884', '2048', '18000', '51.0'], 
      ['118', '2048', '1500', '51.0'], 
      ['103', '2048', '2100', '51.0000002839'], 
      ['18542', '480', '21600', '51.0000000001'], 
      ['13272', '2048', '18000', '51.0000000001']]

print "Xsrc:", Xsrc

##Test data set
Xtest= [['1224', '2048', '1500', '51.0000000161'],
       ['7867', '2048', '10800', '51.0000000002'],
       ['21594', '512', '21600', '-1'], 
       ['1760', '512', '2700', '51.0000000004'],
       ['115', '1024', '21600', '51.1042617556']]


##Clustering 
X = StandardScaler().fit_transform(Xsrc)
db = DBSCAN(min_samples=2).fit(X) #no clustering parameter, such as default eps
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
clusters = [X[labels == i] for i in xrange(n_clusters_)]

print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))


##Plotting the dataset
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=20)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=10)


plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

关于如何使用聚类来预测值的任何想法?

聚类不是预测

聚类标签 "predicting" 几乎没有用,因为它刚刚被聚类算法分配 "randomly"。

更糟糕的是:大多数算法无法合并新数据。

您确实应该使用聚类来探索您的数据,并了解其中存在的内容和不存在的内容。 不要依赖聚类'good'。

有时,人们成功地将数据集量化为k个中心,然后只使用这个"compressed"数据集classification/prediction(通常基于仅在最近的邻居上)。我还看到了围绕每个集群训练一个 mregression 进行预测的想法,并使用最近的邻居选择要应用的回归器(即,如果数据很好地适合集群,则使用集群回归模型)。但是我不记得有什么重大的成功故事...