在 Sci-Kit Learn 中拆分数据集以进行 K 折交叉验证

Question

我被分配了一项任务，需要创建决策树分类器并使用训练集和 10 折交叉验证确定准确率。我查看了 cross_val_predict 的文档，因为我相信这是我需要的模块。

我遇到的问题是数据集的拆分。据我所知，在通常情况下，train_test_split() 方法用于将数据集拆分为 2 - train 和 test。据我了解，对于 K 折验证，您需要将训练集进一步拆分为 K 个部分。

我的问题是：一开始的数据集需要拆分成train和test吗？

Answer 1

视情况而定。我个人的意见是，你必须将数据集拆分为训练集和测试集，然后你可以使用 K-folds 对你的训练集进行交叉验证。为什么？因为在你的训练之后进行测试并在看不见的例子上微调你的模型是很有趣的。

但是有些人只是做一个交叉验证。这是我经常使用的工作流程：

# Data Partition
X_train, X_valid, Y_train, Y_valid = model_selection.train_test_split(X, Y, test_size=0.2, random_state=21)

# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
# Then visualize the score you just obtain using mean, std or plot
print('Mean CV-score : ' + str(cv_score.mean()))

# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
    model = model_with_param
    cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
    print('Mean CV-score with param: ' + str(cv_score.mean()))

# Now I have best parameters for the model, I can train the final model
model = model_with_best_parameters
model.fit(X_train, y_train)

# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred, y_test)

Answer 2

简答：否

长答案。 如果你想使用 K-fold validation 时通常不拆分为 train/test.

评估模型的方法有很多种。最简单的一种是使用 train/test 拆分，在 train 集上拟合模型并使用 test.

进行评估

如果采用交叉验证的方式，那么在每个fold/iteration.

期间直接做fitting/evaluation

选择什么取决于你，但我会选择 K-Folds 或 LOOCV。

图中总结了 K-Folds 过程（对于 K=5）：

在 Sci-Kit Learn 中拆分数据集以进行 K 折交叉验证

Splitting a data set for K-fold Cross Validation in Sci-Kit Learn

python

machine-learning

decision-tree

scikit-learn

k-fold