Gridsearch CV 会在创建折叠之前对数据进行洗牌吗?

Does Gridsearch CV shuffle the data before creating the folds?

我使用 sklearn GridsearchCV 来调整超参数,但想知道我给它的数据集是否会在创建折叠之前被洗牌。我希望它不要被打乱,但我找不到它是否在文档中。像 train_test_split 这样的东西有一个布尔值来洗牌。

默认情况下,GridSearchCV 将使用干净的 StratifiedKFold 或 KFold 交叉验证器。这些交叉验证器的默认值是 shuffle=False。 GridSearchCV 的 cv 参数文档也提供了一些额外的信息。

来自documentation

3.1.3. A note on shuffling

If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.

Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. Note that:

This consumes less memory than shuffling the data directly.

By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.

The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.

To get identical results for each split, set random_state to an integer.