在 pandas DataFrame 中的每个组中应用 kmeans,并将集群保存在同一 DataFrame 的新列中
Apply kmeans on in each group in pandas DataFrame and save the clusters in a new column in the same DataFrame
我有一个数据框,其中包含 D 列中的一些嵌入。我想先按 A 列对数据进行分组,然后对每个组应用 kmeans。每个组可能包含 nan 值,因此在 apply 函数中,我将簇数视为 D 列中非 nan 值的数量除以 2 (n_clusters = int(not_na_mask.sum()/2)
)。
在应用函数中我returndf['cluster'].values.tolist()
。我打印了这个值,它对每个组都是正确的,但是在 运行 之后,整个脚本 df_test['clusters']
只在所有行中包含 nan。
示例数据帧:
df_test = pd.DataFrame({'A' : ['aa', 'bb', 'aa', 'bb','aa', 'bb', 'aa', 'cc', 'aa', 'aa', 'bb', 'bb', 'bb','cc', 'bb', 'aa', 'cc', 'aa'],
'B' : [1, 2, np.nan, 4, 6, np.nan, 7, 8, np.nan, 1, 4, 3, 4, 7, 5, 7, 9, np.nan],
'D' : [[2, 0, 1, 5, 4, 0], np.nan, [4, 7, 0, 1, 0, 2], [1., 1, 1, 2, 0, 5], np.nan , [1, 6, 3, 2, 1, 9], [4, 2, 1, 0, 0, 0], [3, 5, 6, 8, 8, 0], np.nan,
np.nan, [2, 5, 1, 7, 4, 0] , [4, 2, 0, 4, 0, 0], [1., 0, 1, 8, 0, 9], [1, 0, 7, 2, 1, 0], np.nan , [1, 1, 5, 0, 8, 0], [4, 1, 6, 1, 1, 0], np.nan]})
df_test:
A B D
0 aa 1.0 [2, 0, 1, 5, 4, 0]
1 bb 2.0 NaN
2 aa NaN [4, 7, 0, 1, 0, 2]
3 bb 4.0 [1.0, 1, 1, 2, 0, 5]
4 aa 6.0 NaN
5 bb NaN [1, 6, 3, 2, 1, 9]
6 aa 7.0 [4, 2, 1, 0, 0, 0]
7 cc 8.0 [3, 5, 6, 8, 8, 0]
8 aa NaN NaN
9 aa 1.0 NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0]
11 bb 3.0 [4, 2, 0, 4, 0, 0]
12 bb 4.0 [1.0, 0, 1, 8, 0, 9]
13 cc 7.0 [1, 0, 7, 2, 1, 0]
14 bb 5.0 NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0]
16 cc 9.0 [4, 1, 6, 1, 1, 0]
17 aa NaN NaN
我计算 kmeans 的方法:
def apply_kmeans_on_each_category(df):
not_na_mask = df['D'].notna()
embedding = df[not_na_mask]['D']
n_clusters = int(not_na_mask.sum()/2)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
df.loc[not_na_mask, 'cluster'] = kmeans.labels_
return df['cluster'].values.tolist()
else:
return [np.nan] * len(df)
df_test['clusters'] = df_test.groupby('A').apply(apply_kmeans_on_each_category)
结果:
df_test['clusters']:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
Name: clusters, dtype: object
做了一些细微的改动。更改的主要内容是使用 transform
而不是 apply
。此外,无需传递整个 Grouper
df,您可以直接传递列 D
,因为这是您使用的唯一列 -
def apply_kmeans_on_each_category(df):
not_na_mask = df.notna()
embedding = df.loc[not_na_mask]
n_clusters = int(not_na_mask.sum()/2)
op = pd.Series([np.nan] * len(df), index=df.index)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
op.loc[not_na_mask] = kmeans.labels_.tolist()
return op
df_test['clusters'] = df_test.groupby('A')['D'].transform(apply_kmeans_on_each_category)
输出
A B D clusters
0 aa 1.0 [2, 0, 1, 5, 4, 0] 0.0
1 bb 2.0 NaN NaN
2 aa NaN [4, 7, 0, 1, 0, 2] 1.0
3 bb 4.0 [1.0, 1, 1, 2, 0, 5] 0.0
4 aa 6.0 NaN NaN
5 bb NaN [1, 6, 3, 2, 1, 9] 0.0
6 aa 7.0 [4, 2, 1, 0, 0, 0] 1.0
7 cc 8.0 [3, 5, 6, 8, 8, 0] NaN
8 aa NaN NaN NaN
9 aa 1.0 NaN NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0] 1.0
11 bb 3.0 [4, 2, 0, 4, 0, 0] 1.0
12 bb 4.0 [1.0, 0, 1, 8, 0, 9] 0.0
13 cc 7.0 [1, 0, 7, 2, 1, 0] NaN
14 bb 5.0 NaN NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0] 0.0
16 cc 9.0 [4, 1, 6, 1, 1, 0] NaN
17 aa NaN NaN NaN
我有一个数据框,其中包含 D 列中的一些嵌入。我想先按 A 列对数据进行分组,然后对每个组应用 kmeans。每个组可能包含 nan 值,因此在 apply 函数中,我将簇数视为 D 列中非 nan 值的数量除以 2 (n_clusters = int(not_na_mask.sum()/2)
)。
在应用函数中我returndf['cluster'].values.tolist()
。我打印了这个值,它对每个组都是正确的,但是在 运行 之后,整个脚本 df_test['clusters']
只在所有行中包含 nan。
示例数据帧:
df_test = pd.DataFrame({'A' : ['aa', 'bb', 'aa', 'bb','aa', 'bb', 'aa', 'cc', 'aa', 'aa', 'bb', 'bb', 'bb','cc', 'bb', 'aa', 'cc', 'aa'],
'B' : [1, 2, np.nan, 4, 6, np.nan, 7, 8, np.nan, 1, 4, 3, 4, 7, 5, 7, 9, np.nan],
'D' : [[2, 0, 1, 5, 4, 0], np.nan, [4, 7, 0, 1, 0, 2], [1., 1, 1, 2, 0, 5], np.nan , [1, 6, 3, 2, 1, 9], [4, 2, 1, 0, 0, 0], [3, 5, 6, 8, 8, 0], np.nan,
np.nan, [2, 5, 1, 7, 4, 0] , [4, 2, 0, 4, 0, 0], [1., 0, 1, 8, 0, 9], [1, 0, 7, 2, 1, 0], np.nan , [1, 1, 5, 0, 8, 0], [4, 1, 6, 1, 1, 0], np.nan]})
df_test:
A B D
0 aa 1.0 [2, 0, 1, 5, 4, 0]
1 bb 2.0 NaN
2 aa NaN [4, 7, 0, 1, 0, 2]
3 bb 4.0 [1.0, 1, 1, 2, 0, 5]
4 aa 6.0 NaN
5 bb NaN [1, 6, 3, 2, 1, 9]
6 aa 7.0 [4, 2, 1, 0, 0, 0]
7 cc 8.0 [3, 5, 6, 8, 8, 0]
8 aa NaN NaN
9 aa 1.0 NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0]
11 bb 3.0 [4, 2, 0, 4, 0, 0]
12 bb 4.0 [1.0, 0, 1, 8, 0, 9]
13 cc 7.0 [1, 0, 7, 2, 1, 0]
14 bb 5.0 NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0]
16 cc 9.0 [4, 1, 6, 1, 1, 0]
17 aa NaN NaN
我计算 kmeans 的方法:
def apply_kmeans_on_each_category(df):
not_na_mask = df['D'].notna()
embedding = df[not_na_mask]['D']
n_clusters = int(not_na_mask.sum()/2)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
df.loc[not_na_mask, 'cluster'] = kmeans.labels_
return df['cluster'].values.tolist()
else:
return [np.nan] * len(df)
df_test['clusters'] = df_test.groupby('A').apply(apply_kmeans_on_each_category)
结果:
df_test['clusters']:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
Name: clusters, dtype: object
做了一些细微的改动。更改的主要内容是使用 transform
而不是 apply
。此外,无需传递整个 Grouper
df,您可以直接传递列 D
,因为这是您使用的唯一列 -
def apply_kmeans_on_each_category(df):
not_na_mask = df.notna()
embedding = df.loc[not_na_mask]
n_clusters = int(not_na_mask.sum()/2)
op = pd.Series([np.nan] * len(df), index=df.index)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
op.loc[not_na_mask] = kmeans.labels_.tolist()
return op
df_test['clusters'] = df_test.groupby('A')['D'].transform(apply_kmeans_on_each_category)
输出
A B D clusters
0 aa 1.0 [2, 0, 1, 5, 4, 0] 0.0
1 bb 2.0 NaN NaN
2 aa NaN [4, 7, 0, 1, 0, 2] 1.0
3 bb 4.0 [1.0, 1, 1, 2, 0, 5] 0.0
4 aa 6.0 NaN NaN
5 bb NaN [1, 6, 3, 2, 1, 9] 0.0
6 aa 7.0 [4, 2, 1, 0, 0, 0] 1.0
7 cc 8.0 [3, 5, 6, 8, 8, 0] NaN
8 aa NaN NaN NaN
9 aa 1.0 NaN NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0] 1.0
11 bb 3.0 [4, 2, 0, 4, 0, 0] 1.0
12 bb 4.0 [1.0, 0, 1, 8, 0, 9] 0.0
13 cc 7.0 [1, 0, 7, 2, 1, 0] NaN
14 bb 5.0 NaN NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0] 0.0
16 cc 9.0 [4, 1, 6, 1, 1, 0] NaN
17 aa NaN NaN NaN