是否使用 GridSearchCV/RandomizedCV 中的验证集？

Question

据我所知，交叉验证（在 GridSearchCV/RandomizedSearchCV 中）会将数据拆分为多个折叠，其中每个折叠充当一次验证集。但是 sklearn 的一项建议：

Model selection by evaluating various parameter settings can be seen as a way to use the labeled data to “train” the parameters of the grid. When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process: it is recommended to split the data into a development set (to be fed to the GridSearchCV instance) and an evaluation set to compute performance metrics. This can be done by using the train_test_split utility function.

所以我们可能会使用“train_test_split”将原始数据拆分为训练数据和有效数据

X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.25)

并在 GridSearchCV/RandomizedSearchCV 中使用 X_train、y_train，在 fit_params 中使用 X_val、y_val 以适应 eval_set .

真的有用吗？

我们应用拆分原始数据两次（SearchCV 和 train_test_split）--> 必要吗？

SearchCV 中应用的数据较少（X 对比 X_train）--> 训练准确性较低？

Answer 1

此处的文档将评估集称为测试集。因此，您应该使用 train_test_split 将数据拆分为训练集和测试集。

这对执行此操作很有用 train_test_split，因为您随后将能够使用包含未见数据的测试集来验证模型的结果。

训练集将在 GridSearchCV 期间用于为您的模型找到最佳参数。如文档中所述，您可以使用 cv 参数使用 n-1 折叠训练模型并使用 1 折叠对其进行验证。

我建议在 GridSearchCV 期间使用交叉验证集，而不是使用修复验证集，因为这可以让您更好地了解模型对未见数据的执行情况。

是否使用 GridSearchCV/RandomizedCV 中的验证集？

Using Validation Set in GridSearchCV/RandomizedCV or not?

machine-learning

scikit-learn

cross-validation

grid-search