test_size 在 python sklearn 中用于 10 折交叉验证时如何关联

Question

我正在尝试实现一个 ML 算法，我想在其中使用 10 折交叉验证过程，但我只想确认我的程序是否正确。

我正在做二进制 class化，在我创建的 10 个文件夹中，每个 class 都有大约 50 个样本，称为 fold 1、fold 2 , 依此类推。

我的sklearn命令是：

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.3, random_state=1000)

我在这里完全错了吗，这个过程实际上只是在做 30% 的测试和 70% 的训练过程？对于 10 折交叉验证，我应该使用：

from sklearn.model_selection import KFold
kf = KFold(n_splits=2, random_state=42, shuffle=True)

谢谢！

Answer 1

Am I totally wrong here and this procedure is actually just doing a 30% test and 70% train process?

是的，设置 test_size=0.3 会给你 30% 的测试大小和 70% 的训练大小。我们从 reading the documentation.

知道这一点

test_size float or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split

如果你用不同的 random_state 重复这 10 次，那么在这 10 次重复中，测试集中会有一些重复的元素。 k-fold交叉验证的目的是创建kdisjoint组，每个组使用反过来作为坚持。你的程序不是交叉验证，因为你通过这个程序产生的集合永远不会不相交（你可以用鸽巢原理证明这一点）。

kf = KFold(n_splits=2, random_state=42, shuffle=True)

这不是 10 倍的 CV，因为 n_splits=2。我们从 reading the documentation 知道这一点。参数 n_splits 应该是折叠的数量。你说你想要 10 次分割。

test_size 在 python sklearn 中用于 10 折交叉验证时如何关联

How does test_size relate when used in python sklearn for a 10 fold cross validation

training-data

python-3.x

sklearn-pandas