减少作为列包含在 pandas DataFrame 中的字典的值

Reduce the values of a dictionary included as a column in a pandas DataFrame

我有以下 Python 代码,它使用指定聚类算法的参数组合创建 DataFrame。

函数调用如下:

fixed_params = {"random_state": 1234} 
param_grid = {"n_clusters": range(2,4), "max_iter": [200, 300]}

dataset = myGridSearch(df, fixed_params, param_grid, "KMeans")
print(dataset)

函数returns下一个结果pandas DataFrame:

| params                                                                                                                                                           | num_cluster  | silhouette |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ | ---------- |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 200, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 2            | 0.854996   | 
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 2            | 0.854996   | 
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 200, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 3            | 0.742472   | 
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 3            | 0.742472   | 

我希望一旦获得这个DataFrame,列'param'只包含正在变化的参数的信息,即存储在grid_param中的参数。生成的 DataFrame 的想法如下:

| params                                | num_cluster  | silhouette |
| ------------------------------------- | ------------ | ---------- |
| {'max_iter': 200, 'n_clusters': 2}    | 2            | 0.854996   | 
| {'max_iter': 300, 'n_clusters': 2}    | 2            | 0.854996   | 
| {'max_iter': 200, 'n_clusters': 3}    | 3            | 0.742472   | 
| {'max_iter': 300, 'n_clusters': 3}    | 3            | 0.742472   | 

如果您需要将 myGridSearch 函数的代码发给我,请在评论中告诉我。

IIUC,可以使用pandas.json_normalize to create multiple columns from "params", then filter the non-unique values using nunique and boolean indexing, finally convert back to_dict:

df2 = pd.json_normalize(dataset['params'])
dataset['params'] = pd.Series(df2.loc[:, df2.nunique().gt(1)]
                                 .to_dict(orient='index'))

输出:

                               params  num_cluster  silhouette
0  {'max_iter': 200, 'n_clusters': 2}            2    0.854996
1  {'max_iter': 300, 'n_clusters': 2}            2    0.854996
2  {'max_iter': 200, 'n_clusters': 3}            3    0.742472
3  {'max_iter': 300, 'n_clusters': 3}            3    0.742472

中级:

df2.nunique()

algorithm       1
copy_x          1
init            1
max_iter        2
n_clusters      2
n_init          1
random_state    1
tol             1
verbose         1
dtype: int64