KFolds 交叉验证与 train_test_split

Question

我今天刚刚构建了我的第一个 random forest classifier，我正在努力提高它的性能。我正在阅读有关 cross-validation 对于避免 overfitting 数据并因此获得更好结果的重要性。我使用 sklearn 实现了 StratifiedKFold，然而，令人惊讶的是这种方法的准确性较低。我读过许多帖子表明 cross-validating 比 train_test_split.

更有效

估算器：

rf = RandomForestClassifier(n_estimators=100, random_state=42)

K折：

ss = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in ss.split(features, labels):
    train_features, test_features = features[train_index], features[test_index]
    train_labels, test_labels = labels[train_index], labels[test_index]

TTS:

train_feature, test_feature, train_label, test_label = \
    train_test_split(features, labels, train_size=0.8, test_size=0.2, random_state=42)

结果如下：

简历：

AUROC:  0.74
Accuracy Score:  74.74 %.
Specificity:  0.69
Precision:  0.75
Sensitivity:  0.79
Matthews correlation coefficient (MCC):  0.49
F1 Score:  0.77

TTS:

AUROC:  0.76
Accuracy Score:  76.23 %.
Specificity:  0.77
Precision:  0.79
Sensitivity:  0.76
Matthews correlation coefficient (MCC):  0.52
F1 Score:  0.77

这真的可能吗？还是我错误地设置了我的模型？

此外，这是使用交叉验证的正确方法吗？

Answer 1

很高兴看到您记录了自己！

造成这种差异的原因是 TTS 方法引入了偏差（因为您没有使用所有观察结果进行测试）这解释了差异。

In the validation approach, only a subset of the observations—those that are included in the training set rather than in the validation set—are used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

结果可能相差很大：

the validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set

交叉验证通过使用所有可用数据来消除偏差来处理这个问题。

这里你的 TTS 方法的结果有更多的偏见，在分析结果时应该记住这一点。也许你也幸运地获得了 Test/Validation set sampled

再次，这里有一篇对初学者友好的很棒的文章，详细介绍了该主题： https://codesachin.wordpress.com/2015/08/30/cross-validation-and-the-bias-variance-tradeoff-for-dummies/

如需更多 in-depth 来源，请参阅 "Model Assessment and selection" 此处章节（引用内容来源）：

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

Answer 2

Cross-validation 倾向于对数据中的选择偏差进行校正。所以，例如如果您专注于 AUC 指标并在 TTS 方法中获得较低的 AUC 分数，则意味着您的 TTS 存在偏差。

您可能需要进行分析以找出这种偏差（例如，您可以更多地关注日期特征（确保您不使用未来来预测过去）或试图找出数据中的任何类型的泄漏与业务逻辑相关联）

总的来说，分数的差异在我看来并没有那么大，不用太担心。所以，代码看起来没问题，这样的分数差异是可能的。

顺便说一句，无论如何你都没有描述 problem/data，但是你使用了 Stratified KFold CV，所以我假设你有一个不平衡的数据集，但如果没有，序数 KFold CV 可能值得一试。在您的 TTS 中，您没有实施 class 平衡，但它是由 Stratified CV

完成的

KFolds 交叉验证与 train_test_split

KFolds Cross Validation vs train_test_split

python

machine-learning

scikit-learn

cross-validation