为什么sklearn cross_validate()会改装？

Question

我明白为什么像GridSearchCV这样的工具会改装。它探索一系列超参数值并在比较分数后，使用在整个数据集上找到的最佳参数重新拟合估计器。

虽然这是有道理的，但我的问题是关于 cross_validate class 的，其中只使用了一组超参数。我对其目的的理解是为了了解模型对 train/test 拆分的不同折叠的概括程度。为什么这里要改装？

我明白为什么n 次拟合会发生在n 次数据上。但是根据文档，也会发生改装，如 error_score 参数中所述：

error_score : ‘raise’ or numeric Value

to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

所以在 n 次拟合之上，还有一个额外的拟合，我不明白为什么会这样。这个 class 没有预测方法，所以即使它以某种方式区分模型并选择了一个 'best' 模型（尽管它们都具有完全相同的超参数），也没有必要进行改装。

为了证明这一点，我创建了一个我知道的 MLPRegressor 模型，结合我的数据集会有爆炸梯度：

DL = MLPRegressor(
        hidden_layer_sizes=(200, 200, 200), activation='relu', max_iter=16,
            solver='sgd', learning_rate='invscaling', power_t=0.9)
DL.fit(df_training[predictor_cols], df_training[target_col])

模型拟合无误（证明我的数据集中没有 NaN 或 inf 值）但确实给出了警告：

RuntimeWarning: overflow encountered in matmul

这证明了梯度爆炸，因此任何预测的输出都是 NaN。

根据我对 cross_validate 文档的理解，如果我通过以下（使用 error_score=1）：

DL = MLPRegressor(
        hidden_layer_sizes=(200, 200, 200), activation='relu', max_iter=16,
            solver='sgd', learning_rate='invscaling', power_t=0.9)

DL_CV = cross_validate(DL, df_training[predictor_cols], y=df_training[target_col], cv=None, n_jobs=1, pre_dispatch=5, return_train_score=False, return_estimator=True, error_score=1)

我应该收到 'FitFailedWarning' 消息但没有错误。但是，训练并未完成，而是引发了以下错误：

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

因此我得出结论，错误是由于改装造成的，但我不知道改装的目的是什么......

Answer 1

cross_validate没有改装，你可以从源代码中验证。文档不正确，可能是从 GridSearchCV 的文档中复制的。您应该打开一个问题或提出一个拉取请求；如果你不愿意，我可以。

虽然我不知道你最终错误的来源；也许错误是在对成功拟合的模型进行评分时出现的，而不是在拟合过程中？如果原始匹配仅引发警告，则默认情况下不会在搜索中被捕获。

为什么sklearn cross_validate()会改装？

Why does sklearn cross_validate() refit?

python

machine-learning

scikit-learn

cross-validation