GridSearchCV 从最佳估计器中给出的分数不同于 refit 参数中指示的分数

Question

我正在使用 GridSearchCV 进行超参数优化

scoring_functions = {'mcc': make_scorer(matthews_corrcoef), 'accuracy': make_scorer(accuracy_score), 'balanced_accuracy': make_scorer(balanced_accuracy_score)}

grid_search = GridSearchCV(pipeline, param_grid=grid, scoring=scoring_functions, n_jobs=-1, cv=splitter, refit='mcc')

我将 refit 参数设置为 'mcc'，因此我希望 GridSearchCV 选择最佳模型来最大化此指标。然后我计算一些分数

preds = best_model.predict(test_df)
metrics['accuracy'] = round(accuracy_score(test_labels, preds),3)
metrics['balanced_accuracy'] = round(balanced_accuracy_score(test_labels, preds),3)
metrics['mcc'] = round(matthews_corrcoef(test_labels, preds),3)

我得到了这些结果

"accuracy": 0.891, "balanced_accuracy": 0.723, "mcc": 0.871

现在，如果我这样做是为了获得模型在同一测试集上的分数（不是先计算预测），就像这样

best_model = grid_search.best_estimator_
score = best_model.score(test_df, test_labels)

我得到的分数是这样的

"score": 0.891

如您所见，这是准确度而非 mcc 分数。根据评分函数的文档，它说

Returns the score on the given data, if the estimator has been refit.

This uses the score defined by scoring where provided, and the best_estimator_.score method otherwise.

我没有理解正确。我想如果我像我在 GridSearchCV 中使用 refit 参数指定的那样重新调整模型，结果应该是用于重新调整模型的评分函数？我错过了什么吗？

Answer 1

当您访问属性 best_estimator_ 时，您将转到底层基础模型，忽略您对 GridSearchCV 对象所做的所有设置：

best_model = grid_search.best_estimator_
score = best_model.score(test_df, test_labels)

您应该改用 grid_search.score()，并且通常与该对象进行交互。比如预测的时候，使用grid_search.predict().

这些方法的特征与标准 Estimator 的特征相同（拟合、预测、评分等）。

您可以使用底层模型，但它不一定会继承您对网格搜索对象本身所做的配置。

GridSearchCV 从最佳估计器中给出的分数不同于 refit 参数中指示的分数

GridSearchCV giving score from the best estimator different from the one indicated in refit parameter

scoring

python-3.x

scikit-learn

gridsearchcv