编辑:K 表示聚类并找到最接近质心的点
Edited: K means clustering and finding points closest to the centroid
我正在尝试根据以下列中的信息将 k 均值应用于聚类演员
Actors Movies TvGuest Awards Shorts Special LiveShows
Robert De Niro 111 2 6 0 0 0
Jack Nicholson 70 2 4 0 5 0
Marlon Brando 64 2 5 0 0 28
Denzel Washington 25 2 3 24 0 0
Katharine Hepburn 90 1 2 0 0 0
Humphrey Bogart 105 2 1 0 0 52
Meryl Streep 27 2 2 5 0 0
Daniel Day-Lewis 90 2 1 0 71 22
Sidney Poitier 63 2 3 0 0 0
Clark Gable 34 2 4 0 3 0
Ingrid Bergman 22 2 2 3 0 4
Tom Hanks 82 11 6 21 11 22
#began by scaling my data
X = StandardScaler().fit_transform(data)
#used an elbow plot to find optimal k value
sum_of_squared_distances = []
K = range(1,15)
for k in K:
k_means = KMeans(n_clusters=k)
model = k_means.fit(X)
sum_of_squared_distances.append(k_means.inertia_)
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.show()
#found yhat for the calculated k value
kmeans = KMeans(n_clusters=3)
model = kmeans.fit(X)
yhat = kmeans.predict(X)
无法计算演员创建散点图。
编辑:
如果质心也使用
绘制,有没有办法找到哪些演员最接近质心
centers = kmeans.cluster_centers_(这里的kmeans参考下面Eric的解法)
plt.scatter(centers[:,0],centers[:,1],color='purple',marker='*',label='centroid')
K 表示聚类 Pandas - 散点图
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=['Actors', 'Movies', 'TvGuest', "Awards", "Shorts"])
df.loc[0] = ["Robert De Niro", 111, 2, 6, 0]
df.loc[1] = ["Jack Nicholson", 70, 2, 4, 0]
df.loc[2] = ["Marlon Brando", 64, 4, 5, 0]
df.loc[3] = ["Denzel Washington", 25, 2, 3, 24]
df.loc[4] = ["Katharine Hepburn", 90, 1, 2, 0]
df.loc[5] = ["Humphrey Bogart", 105, 2, 1, 0]
df.loc[6] = ["Meryl Streep", 27, 3, 2, 5]
df.loc[7] = ["Daniel Day-Lewis", 90, 2, 1, 0]
df.loc[8] = ["Sidney Poitier", 63, 2, 3, 0]
df.loc[9] = ["Clark Gable", 34, 2, 4, 0]
df.loc[10] = ["Ingrid Bergman", 22, 5, 2, 3]
kmeans = KMeans(n_clusters=4)
y = kmeans.fit_predict(df[['Movies', 'TvGuest', 'Awards']])
df['Cluster'] = y
plt.scatter(df.Movies, df.TvGuest, c=df.Cluster, alpha = 0.6)
plt.title('K-means Clustering 2 dimensions and 4 clusters')
plt.show()
显示:
注意二维散点图上显示的数据点是 Movies
和 TvGuest
,但是 Kmeans 拟合给出了 3 个变量:Movies
、TvGuest
、Awards
。想象一下,屏幕上有一个额外的维度,用于计算集群的成员资格。
源链接:
https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
https://towardsdatascience.com/visualizing-clusters-with-pythons-matplolib-35ae03d87489
我正在尝试根据以下列中的信息将 k 均值应用于聚类演员
Actors Movies TvGuest Awards Shorts Special LiveShows
Robert De Niro 111 2 6 0 0 0
Jack Nicholson 70 2 4 0 5 0
Marlon Brando 64 2 5 0 0 28
Denzel Washington 25 2 3 24 0 0
Katharine Hepburn 90 1 2 0 0 0
Humphrey Bogart 105 2 1 0 0 52
Meryl Streep 27 2 2 5 0 0
Daniel Day-Lewis 90 2 1 0 71 22
Sidney Poitier 63 2 3 0 0 0
Clark Gable 34 2 4 0 3 0
Ingrid Bergman 22 2 2 3 0 4
Tom Hanks 82 11 6 21 11 22
#began by scaling my data
X = StandardScaler().fit_transform(data)
#used an elbow plot to find optimal k value
sum_of_squared_distances = []
K = range(1,15)
for k in K:
k_means = KMeans(n_clusters=k)
model = k_means.fit(X)
sum_of_squared_distances.append(k_means.inertia_)
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.show()
#found yhat for the calculated k value
kmeans = KMeans(n_clusters=3)
model = kmeans.fit(X)
yhat = kmeans.predict(X)
无法计算演员创建散点图。
编辑: 如果质心也使用
绘制,有没有办法找到哪些演员最接近质心centers = kmeans.cluster_centers_(这里的kmeans参考下面Eric的解法)
plt.scatter(centers[:,0],centers[:,1],color='purple',marker='*',label='centroid')
K 表示聚类 Pandas - 散点图
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=['Actors', 'Movies', 'TvGuest', "Awards", "Shorts"])
df.loc[0] = ["Robert De Niro", 111, 2, 6, 0]
df.loc[1] = ["Jack Nicholson", 70, 2, 4, 0]
df.loc[2] = ["Marlon Brando", 64, 4, 5, 0]
df.loc[3] = ["Denzel Washington", 25, 2, 3, 24]
df.loc[4] = ["Katharine Hepburn", 90, 1, 2, 0]
df.loc[5] = ["Humphrey Bogart", 105, 2, 1, 0]
df.loc[6] = ["Meryl Streep", 27, 3, 2, 5]
df.loc[7] = ["Daniel Day-Lewis", 90, 2, 1, 0]
df.loc[8] = ["Sidney Poitier", 63, 2, 3, 0]
df.loc[9] = ["Clark Gable", 34, 2, 4, 0]
df.loc[10] = ["Ingrid Bergman", 22, 5, 2, 3]
kmeans = KMeans(n_clusters=4)
y = kmeans.fit_predict(df[['Movies', 'TvGuest', 'Awards']])
df['Cluster'] = y
plt.scatter(df.Movies, df.TvGuest, c=df.Cluster, alpha = 0.6)
plt.title('K-means Clustering 2 dimensions and 4 clusters')
plt.show()
显示:
注意二维散点图上显示的数据点是 Movies
和 TvGuest
,但是 Kmeans 拟合给出了 3 个变量:Movies
、TvGuest
、Awards
。想象一下,屏幕上有一个额外的维度,用于计算集群的成员资格。
源链接:
https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
https://towardsdatascience.com/visualizing-clusters-with-pythons-matplolib-35ae03d87489