找到每个集群的平均值并在 pandas 数据框中分配最佳集群

Find mean of each cluster and assign best cluster in pandas dataframe

我想在 X3 列的数据框下方进行聚类,然后为每个聚类找到 X3 的平均值,然后为最高平均值分配 3,为较低平均值分配 2,为最低平均值分配 1。低于数据框

 df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1': 
 [10,15,24,32,8,6,10,23,24,56,45,10,56],'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3': 
 [34,65,34,87,100,65,78,67,34,98,96,46,76]})

我根据下面的X3列做了聚类

def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
return k_means.labels_

cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)

现在找到每个集群和月份的 X3 的平均值,然后对其进行排序并将 3 分配给最大平均值,2 为中等,1 为最低。下面是我所做的,但它不起作用。我怎样才能解决这个问题?谢谢。

mapping = {1: 'weak', 2: 'average', 3: 'good'}
cols=df.columns[3]
df['product_rank'] = df.groupby(['Month','X3_cluster_id']) 
[cols].transform('mean').rank(method='dense').astype(int)
df['product_category'] = df['product_rank'].map(mapping)

分配排名时,请确保按月份分组。

完整代码:

df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':[10,15,24,32,8,6,10,23,24,56,45,10,56],'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':[34,65,34,87,100,65,78,67,34,98,96,46,76]})
def cluster(X, n_clusters):
    k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
    return k_means.labels_

cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)
mapping = {1: 'weak', 2: 'average', 3: 'good'}
df['mean_X3'] = df.groupby(["Month","X3_cluster_id"])["X3"].transform("mean")
df["product_category"] = df.groupby("Month")['mean_X3'].rank(method='dense').astype(int).map(mapping)
print(df)

    Month  X1  X2   X3  X3_cluster_id  mean_X3 product_category
0       1  10  12   34              1    57.80             weak
1       1  15  90   65              2    81.00             good
2       1  24  20   34              1    57.80             weak
3       1  32  40   87              0    66.75          average
4       1   8  10  100              0    66.75          average
5       1   6  15   65              2    81.00             good
6       3  10  30   78              1    57.80             weak
7       3  23  40   67              1    57.80             weak
8       3  24  60   34              0    66.75          average
9       3  56  42   98              2    81.00             good
10      3  45   2   96              2    81.00             good
11      3  10   4   46              0    66.75          average
12      3  56  10   76              1    57.80             weak

当您应用 kmeans 时,已经计算出均值,所以我建议做 1 次拟合,并且 return 每个 groupby 中的标签、均值和排名:

def cluster(X, n_clusters):
    k_means = KMeans(n_clusters=n_clusters).fit(X)
    ranks = np.argsort(k_means.cluster_centers_.ravel())+1
    res = pd.DataFrame({'cluster':range(k_means.n_clusters),
                  'means':k_means.cluster_centers_.ravel(),
                  'ranks':ranks}).loc[k_means.labels_,:]
    res.index = X.index
    return res

那么你要做的就是简单地应用上面的函数并一次性获得排名和平均值:

mapping = {1: 'weak', 2: 'average', 3: 'good'}
res = df.groupby("Month")[['X3']].apply(cluster, n_clusters=3)

    cluster means   ranks
0   1   34.000000   3
1   2   65.000000   1
2   1   34.000000   3
3   0   93.500000   2
4   0   93.500000   2
5   2   65.000000   1
6   0   73.666667   2
7   0   73.666667   2
8   1   40.000000   1
9   2   97.000000   3
10  2   97.000000   3
11  1   40.000000   1
12  0   73.666667   2

您可以应用 map 以及具有左连接的完整数据框:

res['product_category'] = res['ranks'].map(mapping)
df.merge(res,left_index=True,right_index=True)

    Month   X1  X2  X3  cluster means   ranks   product_category
0   1   10  12  34  1   34.000000   1   weak
1   1   15  90  65  0   65.000000   2   average
2   1   24  20  34  1   34.000000   1   weak
3   1   32  40  87  2   93.500000   3   good
4   1   8   10  100 2   93.500000   3   good
5   1   6   15  65  0   65.000000   2   average
6   3   10  30  78  0   73.666667   2   average
7   3   23  40  67  0   73.666667   2   average
8   3   24  60  34  1   40.000000   1   weak
9   3   56  42  98  2   97.000000   3   good
10  3   45  2   96  2   97.000000   3   good
11  3   10  4   46  1   40.000000   1   weak
12  3   56  10  76  0   73.666667   2   average