了解 sklearn GridSearchCV 的 best_score_ 和 best_estimator_

Question

在下面的代码中，我试图理解 best_estimator_ 和 best_score_ 之间的联系。我认为我应该能够通过对 best_estimator_ 的结果进行评分来获得（至少非常接近）best_score_，如下所示：

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

classifier = GridSearchCV(LogisticRegression(penalty='l1'),
                          {'C':10**(np.linspace(1,6,num=11))},
                          scoring='neg_log_loss')

classifier.fit(X_train, y_train)

y_pred = classifier.best_estimator_.predict(X_train)
print(f'{log_loss(y_train,y_pred)}') 
print(f'{classifier.best_score_}')

但是我得到以下输出（不同运行中的数字变化不大）：

7.841241697018637
-0.5470694752031108

我知道 best_score_ 将被计算为交叉验证迭代的平均值，但是这肯定是计算整个集合的指标的近似值（甚至是无偏估计量？）。我不明白为什么它们如此不同，所以我认为我犯了一个实施错误。

我如何计算 classifier.best_score_ 自己？

Answer 1

Log_loss 主要是为 predict_proba() 定义的。我假设 GridSearchCV 在内部调用 predict_proba 然后计算分数。

请将 predict() 更改为 predict_proba()，您会看到类似的结果。

y_pred = classifier.best_estimator_.predict_proba(X)

print(log_loss(y_train,y_pred)) 
print(classifier.best_score_)

在 iris 数据集上，我得到以下输出：

0.165794760809
-0.185370083771

看起来很接近。

更新：

看起来是这样的：当您将 'loss_loss' 作为字符串提供给 GridSearchCV 时，this is how its initialized as a scorer to be passed on to _fit_and_score() method of GridSearchCV():

log_loss_scorer = make_scorer(log_loss, greater_is_better=False,
                              needs_proba=True)

如您所见，needs_proba 为真，表示将使用 predict_proba() 进行评分。

了解 sklearn GridSearchCV 的 best_score_ 和 best_estimator_

Understanding sklearn GridSearchCV's best_score_ and best_estimator_

python

statistics

machine-learning

scikit-learn

cross-validation