在 Keras 中进行交叉验证和 validation_data/validation_split 之间的区别

Difference between doing cross-validation and validation_data/validation_split in Keras

首先,我将数据集拆分为训练和测试,例如:

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=999)

然后我使用 GridSearchCV 和交叉验证来找到性能最好的模型:

validator  = GridSearchCV(estimator=clf, param_grid=param_grid, scoring="accuracy", cv=cv)

通过这样做,我得到了:

A model is trained using k-1 of the folds as training data; the resulting model is validated on the remaining part of the data (scikit-learn.org)

但是,当阅读 Keras fit 函数时,文档又引入了 2 个术语:

validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling.

validation_data: tuple (x_val, y_val) or tuple (x_val, y_val, val_sample_weights) on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. validation_data will override validation_split.

据我了解,validation_split(将被 validation_data 覆盖)将用作 未更改的验证数据集,同时交叉验证中的保留集在每个交叉验证步骤中都会发生变化。

执行验证以确保模型不会在数据集上过度拟合,并且会泛化到新数据。由于在参数网格搜索中您也在进行验证,因此无需在训练期间由 Keras 模型本身执行验证步骤。因此回答你的问题:

is it necessary to use validation_split or validation_data since I already do cross validation?

没有,正如我上面提到的。

if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?

不,因为默认情况下在 Keras 中不进行验证(即默认情况下我们在 fit() 方法中有 validation_split=0.0, validation_data=None)。

If I do so, what will happen during the training, would Keras just simply ignore the validation step?

是的,Keras 在训练模型时不会执行验证。但是请注意,正如我上面提到的,网格搜索过程将执行验证以更好地估计具有一组特定参数的模型的性能。