减少作为列包含在 pandas DataFrame 中的字典的值
Reduce the values of a dictionary included as a column in a pandas DataFrame
我有以下 Python 代码,它使用指定聚类算法的参数组合创建 DataFrame。
函数调用如下:
fixed_params = {"random_state": 1234}
param_grid = {"n_clusters": range(2,4), "max_iter": [200, 300]}
dataset = myGridSearch(df, fixed_params, param_grid, "KMeans")
print(dataset)
函数returns下一个结果pandas DataFrame:
| params | num_cluster | silhouette |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ | ---------- |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 200, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 2 | 0.854996 |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 2 | 0.854996 |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 200, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 3 | 0.742472 |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 3 | 0.742472 |
我希望一旦获得这个DataFrame,列'param'只包含正在变化的参数的信息,即存储在grid_param中的参数。生成的 DataFrame 的想法如下:
| params | num_cluster | silhouette |
| ------------------------------------- | ------------ | ---------- |
| {'max_iter': 200, 'n_clusters': 2} | 2 | 0.854996 |
| {'max_iter': 300, 'n_clusters': 2} | 2 | 0.854996 |
| {'max_iter': 200, 'n_clusters': 3} | 3 | 0.742472 |
| {'max_iter': 300, 'n_clusters': 3} | 3 | 0.742472 |
如果您需要将 myGridSearch 函数的代码发给我,请在评论中告诉我。
IIUC,可以使用pandas.json_normalize
to create multiple columns from "params", then filter the non-unique values using nunique
and boolean indexing, finally convert back to_dict
:
df2 = pd.json_normalize(dataset['params'])
dataset['params'] = pd.Series(df2.loc[:, df2.nunique().gt(1)]
.to_dict(orient='index'))
输出:
params num_cluster silhouette
0 {'max_iter': 200, 'n_clusters': 2} 2 0.854996
1 {'max_iter': 300, 'n_clusters': 2} 2 0.854996
2 {'max_iter': 200, 'n_clusters': 3} 3 0.742472
3 {'max_iter': 300, 'n_clusters': 3} 3 0.742472
中级:
df2.nunique()
algorithm 1
copy_x 1
init 1
max_iter 2
n_clusters 2
n_init 1
random_state 1
tol 1
verbose 1
dtype: int64
我有以下 Python 代码,它使用指定聚类算法的参数组合创建 DataFrame。
函数调用如下:
fixed_params = {"random_state": 1234}
param_grid = {"n_clusters": range(2,4), "max_iter": [200, 300]}
dataset = myGridSearch(df, fixed_params, param_grid, "KMeans")
print(dataset)
函数returns下一个结果pandas DataFrame:
| params | num_cluster | silhouette |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ | ---------- |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 200, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 2 | 0.854996 |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 2 | 0.854996 |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 200, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 3 | 0.742472 |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 3 | 0.742472 |
我希望一旦获得这个DataFrame,列'param'只包含正在变化的参数的信息,即存储在grid_param中的参数。生成的 DataFrame 的想法如下:
| params | num_cluster | silhouette |
| ------------------------------------- | ------------ | ---------- |
| {'max_iter': 200, 'n_clusters': 2} | 2 | 0.854996 |
| {'max_iter': 300, 'n_clusters': 2} | 2 | 0.854996 |
| {'max_iter': 200, 'n_clusters': 3} | 3 | 0.742472 |
| {'max_iter': 300, 'n_clusters': 3} | 3 | 0.742472 |
如果您需要将 myGridSearch 函数的代码发给我,请在评论中告诉我。
IIUC,可以使用pandas.json_normalize
to create multiple columns from "params", then filter the non-unique values using nunique
and boolean indexing, finally convert back to_dict
:
df2 = pd.json_normalize(dataset['params'])
dataset['params'] = pd.Series(df2.loc[:, df2.nunique().gt(1)]
.to_dict(orient='index'))
输出:
params num_cluster silhouette
0 {'max_iter': 200, 'n_clusters': 2} 2 0.854996
1 {'max_iter': 300, 'n_clusters': 2} 2 0.854996
2 {'max_iter': 200, 'n_clusters': 3} 3 0.742472
3 {'max_iter': 300, 'n_clusters': 3} 3 0.742472
中级:
df2.nunique()
algorithm 1
copy_x 1
init 1
max_iter 2
n_clusters 2
n_init 1
random_state 1
tol 1
verbose 1
dtype: int64