为什么 GridSearchCV 即使在拟合后也没有 best_estimator_？

Question

我正在使用 scikit learn 学习多类分类。我的目标是开发一个代码，试图包含评估分类所需的所有可能指标。这是我的代码：

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score

param_grid = [
    {'estimator__randomforestclassifier__n_estimators': [3, 10], 'estimator__randomforestclassifier__max_features': [2]},
#    {'estimator__randomforestclassifier__bootstrap': [False], 'estimator__randomforestclassifier__n_estimators': [3, 10], 'estimator__randomforestclassifier__max_features': [2, 3, 4]}
]

rf_classifier = OneVsRestClassifier(
    make_pipeline(RandomForestClassifier(random_state=42))
)

scoring = {'accuracy': make_scorer(accuracy_score),
           'precision_macro': make_scorer(precision_score, average = 'macro'),
           'recall_macro': make_scorer(recall_score, average = 'macro'),
           'f1_macro': make_scorer(f1_score, average = 'macro'),
           'precision_micro': make_scorer(precision_score, average = 'micro'),
           'recall_micro': make_scorer(recall_score, average = 'micro'),
           'f1_micro': make_scorer(f1_score, average = 'micro'),
           'f1_weighted': make_scorer(f1_score, average = 'weighted')}

grid_search = GridSearchCV(rf_classifier, param_grid=param_grid, cv=2, 
scoring=scoring, refit=False)
grid_search.fit(X_train_prepared, y_train)

然而，当我试图找出最佳估算器时，我收到以下错误消息：

print(grid_search.best_params_)
print(grid_search.best_estimator_)

AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

问题：怎么可能即使在拟合模型之后我也没有得到最佳估计量？我注意到，如果我设置 refit="some_of_the_metrics"，我会得到一个估算器，但我不明白为什么我应该使用它，因为它适合优化指标而不是所有指标的方法。因此，我怎样才能得到所有分数的最佳估计量？改装的意义何在？

注意：我试着阅读文档，但对我来说仍然没有意义。

Answer 1

改装的重点是模型将使用之前找到的最佳参数集和整个数据集进行改装。为了找到最好的参数，使用 cross-validation 这意味着数据集总是被分成训练集和验证集，即这里不是整个数据集都用于训练。

当您定义多个指标时，您必须告诉 scikit-learn 它应该如何确定最适合您的指标。为方便起见，您可以只指定任何记分员作为决策者。在那种情况下，最大化该指标的参数集将用于改装。

如果你想要更复杂的东西，比如采用返回所有得分者的最高平均值的参数集，你必须传递一个函数来重新拟合给定所有创建的指标 returns 对应的索引最佳参数集。此参数集将用于重新拟合模型。

这些指标将作为字符串字典作为键传递，NumPy 数组作为值传递。这些 NumPy 数组的条目与已计算的参数集一样多。你会在里面找到很多东西。最相关的可能是 mean_test_*scorer-name*。这些数组包含每个测试参数集的平均值 scorer-name-scorer 在 cv 拆分中计算。

在代码中，要获取参数集的索引，即 returns 所有得分者的最高均值，您可以执行以下操作


def find_best_index(eval_results: dict[str, np.array]) -> int:
    # returns a n-scorers x n-parameter-set dimensional array
    means_of_splits = np.array(
        [values for name, values in eval_results.items() if name.startswith('mean_test')]
    )
    # this is a n-parameter-set dimensional vector
    mean_of_all_scores = np.mean(means_of_splits, axis=0) 
    # get index of maximum value which corresponds to the best parameter set
    return np.argmax(mean_of_all_scores) 


grid_search = GridSearchCV(
    rf_classifier, param_grid=param_grid, cv=2, scoring=scoring, refit=find_best_index
)
grid_search.fit(X_train_prepared, y_train)

为什么 GridSearchCV 即使在拟合后也没有 best_estimator_？

Why doesn't GridSearchCV have best_estimator_ even after fitting?

classification

scikit-learn

grid-search

multiclass-classification

gridsearchcv