在数据框中使用 groupby 的 Kmeans 并在 python 中获取集群
Kmeans with groupby in dataframe and get cluster in python
我正在使用这样的 DataFrame:
df=pd.DataFrame({'ID':['12345','55689','56964','49649','89645','0001',
'033','03330','064963','306193','03661','1666'],
'Culture':['A','A','A','A','A','A','B','B','B','B','B','B'],
'H': [30,42,25,32,12,10,4,6,5,10,24,21],
'S':[10,76,100,23,65,94,67,24,67,54,87,81],
'mean': [23,78,95,52,60,76,68,92,34,76,34,12]})
首先我只选择了一组 df_1=df.loc[(df['Culture']=='A')
来做这样的 kmeans
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=1)
kmeans_predict = km.predict(m)
array([0, 2, 1, 1, 0, 0], dtype=int32)
clusters = {}
n = 0
for item in kmeans_predict:
if item in clusters:
clusters[item].append(list_x1[n])
else:
clusters[item] = [list_x1[n]]
n +=1
我在更多代码后得到了这样的东西:
ID Culture S mean Cluster
12345 A 10 23 0
55689 A 76 78 2
56964 A 100 95 1
49649 A 23 52 1
89645 A 65 60 0
00001 A 94 92 0
我的目标是对该数据框中的每个组进行 kmeans,但我不想逐组进行(文化,因为有超过 75 个组)。我试过类似的东西:
def cluster(X):
k_means = KMeans(n_clusters=3).fit(m).groupby('CUL')
X['cluster'] = k_means.labels_
return X
df= cities_e.groupby('CUL').apply(cluster)
尝试通过 'Culture' 在每个组内进行所有这些聚类,并在 DataFrame 中获得它的预测聚类。
您可以简单地将您的代码包装在一个函数中并使用 groupby.apply
。但是,要获取索引 return 一个系列,而不是一个数组:
from sklearn.cluster import KMeans
def get_cluster(df_1):
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=0).fit(m)
kmeans_predict = km.predict(m)
return pd.Series(kmeans_predict, index=df_1.index)
df['Cluster'] = df.groupby('Culture').apply(get_cluster).droplevel(0)
输出:
ID Culture H S mean Cluster
0 12345 A 30 10 23 2
1 55689 A 42 76 78 0
2 56964 A 25 100 95 1
3 49649 A 32 23 52 2
4 89645 A 12 65 60 2
5 0001 A 10 94 76 1
6 033 B 4 67 68 1
7 03330 B 6 24 92 0
8 064963 B 5 67 34 2
9 306193 B 10 54 76 0
10 03661 B 24 87 34 2
11 1666 B 21 81 12 2
如果你想在不同的文化中使用不同的簇号,我们可以为每个文化分配一个组号,然后用它来修改簇号:
def get_cluster(df_1):
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=0).fit(m)
kmeans_predict = km.predict(m) + 3 * df_1['Culture_id'].iat[0]
return pd.Series(kmeans_predict, index=df_1.index)
g = df.groupby('Culture')
df['Culture_id'] = g.ngroup()
df['Cluster'] = g.apply(get_cluster).droplevel(0)
df = df.drop(columns=['Culture_id'])
输出:
ID Culture H S mean Cluster
0 12345 A 30 10 23 0
1 55689 A 42 76 78 1
2 56964 A 25 100 95 1
3 49649 A 32 23 52 0
4 89645 A 12 65 60 2
5 0001 A 10 94 76 2
6 033 B 4 67 68 3
7 03330 B 6 24 92 5
8 064963 B 5 67 34 4
9 306193 B 10 54 76 3
10 03661 B 24 87 34 4
11 1666 B 21 81 12 4
我正在使用这样的 DataFrame:
df=pd.DataFrame({'ID':['12345','55689','56964','49649','89645','0001',
'033','03330','064963','306193','03661','1666'],
'Culture':['A','A','A','A','A','A','B','B','B','B','B','B'],
'H': [30,42,25,32,12,10,4,6,5,10,24,21],
'S':[10,76,100,23,65,94,67,24,67,54,87,81],
'mean': [23,78,95,52,60,76,68,92,34,76,34,12]})
首先我只选择了一组 df_1=df.loc[(df['Culture']=='A')
来做这样的 kmeans
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=1)
kmeans_predict = km.predict(m)
array([0, 2, 1, 1, 0, 0], dtype=int32)
clusters = {}
n = 0
for item in kmeans_predict:
if item in clusters:
clusters[item].append(list_x1[n])
else:
clusters[item] = [list_x1[n]]
n +=1
我在更多代码后得到了这样的东西:
ID Culture S mean Cluster
12345 A 10 23 0
55689 A 76 78 2
56964 A 100 95 1
49649 A 23 52 1
89645 A 65 60 0
00001 A 94 92 0
我的目标是对该数据框中的每个组进行 kmeans,但我不想逐组进行(文化,因为有超过 75 个组)。我试过类似的东西:
def cluster(X):
k_means = KMeans(n_clusters=3).fit(m).groupby('CUL')
X['cluster'] = k_means.labels_
return X
df= cities_e.groupby('CUL').apply(cluster)
尝试通过 'Culture' 在每个组内进行所有这些聚类,并在 DataFrame 中获得它的预测聚类。
您可以简单地将您的代码包装在一个函数中并使用 groupby.apply
。但是,要获取索引 return 一个系列,而不是一个数组:
from sklearn.cluster import KMeans
def get_cluster(df_1):
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=0).fit(m)
kmeans_predict = km.predict(m)
return pd.Series(kmeans_predict, index=df_1.index)
df['Cluster'] = df.groupby('Culture').apply(get_cluster).droplevel(0)
输出:
ID Culture H S mean Cluster
0 12345 A 30 10 23 2
1 55689 A 42 76 78 0
2 56964 A 25 100 95 1
3 49649 A 32 23 52 2
4 89645 A 12 65 60 2
5 0001 A 10 94 76 1
6 033 B 4 67 68 1
7 03330 B 6 24 92 0
8 064963 B 5 67 34 2
9 306193 B 10 54 76 0
10 03661 B 24 87 34 2
11 1666 B 21 81 12 2
如果你想在不同的文化中使用不同的簇号,我们可以为每个文化分配一个组号,然后用它来修改簇号:
def get_cluster(df_1):
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=0).fit(m)
kmeans_predict = km.predict(m) + 3 * df_1['Culture_id'].iat[0]
return pd.Series(kmeans_predict, index=df_1.index)
g = df.groupby('Culture')
df['Culture_id'] = g.ngroup()
df['Cluster'] = g.apply(get_cluster).droplevel(0)
df = df.drop(columns=['Culture_id'])
输出:
ID Culture H S mean Cluster
0 12345 A 30 10 23 0
1 55689 A 42 76 78 1
2 56964 A 25 100 95 1
3 49649 A 32 23 52 0
4 89645 A 12 65 60 2
5 0001 A 10 94 76 2
6 033 B 4 67 68 3
7 03330 B 6 24 92 5
8 064963 B 5 67 34 4
9 306193 B 10 54 76 3
10 03661 B 24 87 34 4
11 1666 B 21 81 12 4