如何在 scikit-learn 中获得有意义的 kmeans 结果

How to get meaningful results of kmeans in scikit-learn

我有一个如下所示的数据集:

{'dns_query_count': '11', 'http_hostnames_count': '7', 'dest_port_count': '3', 'ip_count': '11', 'signature_count': '0', 'src_ip': '10.0.64.42', 'http_user_agent_count': '2'}

这已经从 csv 转换为 dict

然后我用DictVectorizer来转换它

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
d = vec.fit_transform(data).toarray()

然后我尝试在上面使用 Kmeans

from sklearn.cluster import KMeans
k = KMeans(n_clusters=2).fit(d)

我的问题是如何获取有关我的数据的哪一行属于哪个集群的信息?

我希望得到这样的东西:

{'dns_query_count': '11', 'http_hostnames_count': '7', 'dest_port_count': '3', 'ip_count': '11', 'signature_count': '0', 'src_ip': '10.0.64.42', 'http_user_agent_count': '2', cluster: '1'}

谁能给我一个循序渐进的例子,如何从我展示的原始数据到具有它们所属集群信息的相同数据?

例如,我对这个数据集使用了 Weka,它向我展示了我想要的 - 我可以单击图表上的数据点并准确读取哪些数据点属于哪个集群。如何使用 sklearn 获得相似的结果?

这将展示如何检索每行和聚类中心的聚类 ID。我还测量了每行到每个质心的距离,因此您可以看到这些行已正确分配给集群。

In [1]:

import pandas as pd
from sklearn.cluster import KMeans
from numpy.random import random
from scipy.spatial.distance import euclidean

# I'm going to generate some random data so you can just copy this and see it work

random_data = []

for i in range(0,10):
    random_data.append({'dns_query_count': random(),
 'http_hostnames_count': random(),
 'dest_port_count': random(),
 'ip_count': random(),
 'signature_count': random(),
 'src_ip': random(),
 'http_user_agent_count': random()}
)

df = pd.DataFrame(random_data)

km = KMeans(n_clusters=2).fit(df)

df['cluster_id'] = km.labels_

# get the cluster centers and compute the distance from each point to the center
# this will show that all points are assigned to the correct cluster

def distance_to_centroid(row, centroid):
    row = row[['dns_query_count',
                'http_hostnames_count',
                'dest_port_count',
                'ip_count',
                'signature_count',
                'src_ip',
                'http_user_agent_count']]
    return euclidean(row, centroid)

# to get the cluster centers use km.cluster_centers_

df['distance_to_center0'] = df.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[0]),1)

df['distance_to_center1'] = df.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[1]),1)

df.head()

Out [1]:
   dest_port_count  dns_query_count  http_hostnames_count  \
0         0.516920         0.135925              0.090209   
1         0.528907         0.898578              0.752862   
2         0.426108         0.604251              0.524905   
3         0.373985         0.606492              0.503487   
4         0.319943         0.970707              0.707207   

   http_user_agent_count  ip_count  signature_count    src_ip  cluster_id  \
0               0.987878  0.808556         0.860859  0.642014           0   
1               0.417033  0.130365         0.067021  0.322509           1   
2               0.528679  0.216118         0.041491  0.522445           1   
3               0.780292  0.130404         0.048353  0.911599           1   
4               0.156117  0.719902         0.484865  0.752840           1   

   distance_to_center0  distance_to_center1  
0             0.846099             1.124509  
1             1.175765             0.760310  
2             0.970046             0.615725  
3             1.054555             0.946233  
4             0.640906             1.020849  

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit_predict