在 Keras 中进行交叉验证和 validation_data/validation_split 之间的区别
Difference between doing cross-validation and validation_data/validation_split in Keras
首先,我将数据集拆分为训练和测试,例如:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=999)
然后我使用 GridSearchCV
和交叉验证来找到性能最好的模型:
validator = GridSearchCV(estimator=clf, param_grid=param_grid, scoring="accuracy", cv=cv)
通过这样做,我得到了:
A model is trained using k-1 of the folds as training data; the resulting
model is validated on the remaining part of the data (scikit-learn.org)
但是,当阅读 Keras fit
函数时,文档又引入了 2 个术语:
validation_split: Float between 0 and 1. Fraction of the training data
to be used as validation data. The model will set apart this fraction
of the training data, will not train on it, and will evaluate the loss
and any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling.
validation_data: tuple (x_val, y_val) or tuple (x_val, y_val,
val_sample_weights) on which to evaluate the loss and any model
metrics at the end of each epoch. The model will not be trained on
this data. validation_data will override validation_split.
据我了解,validation_split
(将被 validation_data
覆盖)将用作 未更改的验证数据集,同时交叉验证中的保留集在每个交叉验证步骤中都会发生变化。
- 第一个问题:是否有必要使用
validation_split
或 validation_data
,因为我已经进行了交叉验证?
第二个问题:如果不需要,那validation_split
和validation_data
是否应该分别设置为0和None?
grid_result = validator.fit(train_images, train_labels, validation_data=None, validation_split=0)
问题3:如果我这样做,在训练过程中会发生什么,Keras会不会直接忽略验证步骤?
问题4:validation_split
属于k-1 folds
还是hold-out fold
,还是会被认为是"test set"(就像 cross validation
的情况一样)永远不会用于训练模型。
执行验证以确保模型不会在数据集上过度拟合,并且会泛化到新数据。由于在参数网格搜索中您也在进行验证,因此无需在训练期间由 Keras 模型本身执行验证步骤。因此回答你的问题:
is it necessary to use validation_split or validation_data since I already do cross validation?
没有,正如我上面提到的。
if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?
不,因为默认情况下在 Keras 中不进行验证(即默认情况下我们在 fit()
方法中有 validation_split=0.0, validation_data=None
)。
If I do so, what will happen during the training, would Keras just simply ignore the validation step?
是的,Keras 在训练模型时不会执行验证。但是请注意,正如我上面提到的,网格搜索过程将执行验证以更好地估计具有一组特定参数的模型的性能。
首先,我将数据集拆分为训练和测试,例如:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=999)
然后我使用 GridSearchCV
和交叉验证来找到性能最好的模型:
validator = GridSearchCV(estimator=clf, param_grid=param_grid, scoring="accuracy", cv=cv)
通过这样做,我得到了:
A model is trained using k-1 of the folds as training data; the resulting model is validated on the remaining part of the data (scikit-learn.org)
但是,当阅读 Keras fit
函数时,文档又引入了 2 个术语:
validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling.
validation_data: tuple (x_val, y_val) or tuple (x_val, y_val, val_sample_weights) on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. validation_data will override validation_split.
据我了解,validation_split
(将被 validation_data
覆盖)将用作 未更改的验证数据集,同时交叉验证中的保留集在每个交叉验证步骤中都会发生变化。
- 第一个问题:是否有必要使用
validation_split
或validation_data
,因为我已经进行了交叉验证? 第二个问题:如果不需要,那
validation_split
和validation_data
是否应该分别设置为0和None?grid_result = validator.fit(train_images, train_labels, validation_data=None, validation_split=0)
问题3:如果我这样做,在训练过程中会发生什么,Keras会不会直接忽略验证步骤?
问题4:
validation_split
属于k-1 folds
还是hold-out fold
,还是会被认为是"test set"(就像cross validation
的情况一样)永远不会用于训练模型。
执行验证以确保模型不会在数据集上过度拟合,并且会泛化到新数据。由于在参数网格搜索中您也在进行验证,因此无需在训练期间由 Keras 模型本身执行验证步骤。因此回答你的问题:
is it necessary to use validation_split or validation_data since I already do cross validation?
没有,正如我上面提到的。
if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?
不,因为默认情况下在 Keras 中不进行验证(即默认情况下我们在 fit()
方法中有 validation_split=0.0, validation_data=None
)。
If I do so, what will happen during the training, would Keras just simply ignore the validation step?
是的,Keras 在训练模型时不会执行验证。但是请注意,正如我上面提到的,网格搜索过程将执行验证以更好地估计具有一组特定参数的模型的性能。