python 中的 KFold 究竟是做什么的？

Question

我正在看这个教程：https://www.dataquest.io/mission/74/getting-started-with-kaggle

我到了第 9 部分，进行预测。在名为泰坦尼克号的数据框中有一些数据，然后使用以下方式将其分成几部分：

# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

我不确定它到底在做什么，kf 是什么类型的对象。我尝试阅读文档，但没有多大帮助。还有就是三折（n_folds=3），为什么后面这一行只访问train和test（我怎么知道他们叫train和test）？

for train, test in kf:

Answer 1

KFold 将提供 train/test 索引来拆分训练集和测试集中的数据。它将数据集拆分为 k 个连续的折叠（默认情况下没有改组）。然后每个折叠使用一次验证集，而 k - 1 剩余的折叠形成训练集（source）。

比方说，您有一些从 1 到 10 的数据索引。如果您使用 n_fold=k，在第一次迭代中，您将得到第 i 次 (i<=k) 折叠作为测试索引，并且剩余 (k-1) 次折叠（没有第 i 次折叠）一起作为训练索引。

一个例子

import numpy as np
from sklearn.cross_validation import KFold

x = [1,2,3,4,5,6,7,8,9,10,11,12]
kf = KFold(12, n_folds=3)

for train_index, test_index in kf:
    print (train_index, test_index)

输出

Fold 1: [ 4 5 6 7 8 9 10 11] [0 1 2 3]

Fold 2: [ 0 1 2 3 8 9 10 11] [4 5 6 7]

Fold 3: [0 1 2 3 4 5 6 7] [ 8 9 10 11]

导入 sklearn 0.20 更新：

KFold 对象在 0.20 版本中被移动到 sklearn.model_selection 模块。要在 sklearn 0.20+ 中导入 KFold，请使用 from sklearn.model_selection import KFold。 KFold 当前文档 source

Answer 2

分享目前所学的KF理论知识

KFOLD 是一种模型验证技术，它不使用您的 pre-trained 模型。相反，它只是使用 hyper-parameter 并使用 k-1 数据集训练了一个新模型，并在第 k 个数据集上测试了相同的模型。

K different models are just used for validation.

它会returnK个不同的分数（准确率百分比），这是基于第k个测试数据集。而我们一般取平均值来分析模型。

我们对要分析的所有不同模型重复此过程。简要算法：

将数据拆分为训练和测试部分。
在此训练数据上训练了不同的模型，例如 SVM、RF、LR。

   2.a Take whole data set and divide in to K-Folds.
   2.b Create a new model with the hyper parameter received after training on step 1.
   2.c Fit the newly created model on K-1 data set.
   2.d Test on Kth data set
   2.e Take average score.

分析不同的平均得分和 select SVM、RF 和 LR 中的最佳模型。

这样做的原因很简单，我们通常有数据不足，如果我们将整个数据集分为：

训练
验证
测试

我们可能遗漏了相对较小的数据块，这可能会使我们的模型过拟合。也有可能一些数据在我们的训练中保持不变，我们没有分析针对这些数据的行为。

KF 克服了这两个问题。

Answer 3

该过程有一个名为 k 的参数，它指的是给定数据样本要分成的组数。因此，该过程通常称为 k 折交叉验证。 When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation..

您可以参考这篇post了解更多信息。 https://medium.com/@xzz201920/stratifiedkfold-v-s-kfold-v-s-stratifiedshufflesplit-ffcae5bfdf

python 中的 KFold 究竟是做什么的？

What does KFold in python exactly do?

python

scikit-learn

cross-validation

kaggle