执行 K 折交叉验证：使用相同的训练集与单独的验证集

Question

我正在使用 Python scikit-learn 框架构建决策树。我目前将我的训练数据分成两组，一组用于训练，另一组用于验证（通过 K 折交叉验证实现）。

为了交叉验证我的模型，我应该将我的数据分成上面概述的两组，还是只使用完整的训练集？我主要objective是为了防止过拟合。我在网上看到关于这两种方法的使用和功效的相互矛盾的答案。

据我所知，当没有足够的数据用于单独的验证集时，通常会使用 K 折交叉验证。我没有这个限制。直觉上，我认为结合单独的数据集使用 K 折交叉验证将进一步减少过度拟合。

我的推测是否正确？有没有更好的方法可以用来验证我的模型？

拆分数据集方法：

x_train, x_test, y_train, y_test = train_test_split(df[features], df["SeriousDlqin2yrs"], test_size=0.2, random_state=13)

dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(x_train, y_train)

scores = cross_val_score(dt, x_test, y_test, cv=10)

训练数据集方法：

x_train=df[features]
y_train=df["SeriousDlqin2yrs"]

dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(x_train, y_train)

scores = cross_val_score(dt, x_train, y_train, cv=10)

Answer 1

好吧，看来您对验证以及 cross_val_score 所做的事情感到很困惑。首先，您不应该执行上述任何方法。如果您不是在搜索某些超参数，而是只想回答问题"How good is DT with min_samples_split=20 on my data"，那么您应该这样做：

dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
scores = cross_val_score(dt, X, y, cv=10)

没有任何分裂。为什么？因为 cross_val_score 进行拆分。它的作用是，将 X 和 y 分成 10 个部分，并对 trianing 执行 10 次拟合，然后对剩余部分进行测试。换句话说，如果你做类似

x_train=df[features]
y_train=df["SeriousDlqin2yrs"]

dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(x_train, y_train) # this line does nothing!

scores = cross_val_score(dt, x_train, y_train, cv=10)

那么 fit 命令就没用了，因为 cross_val_score 会再次调用 fit，10 次。此外，您根本不使用 test 设置！同样在你的第二个代码中 - 你既适合又测试测试集，也不正确。

但是，如果你试图拟合一些超参数，假设这个min_samples_split，那么你应该（假设你的测试集足够大可以表示):

X_train, y_train = X[train], y[train]
X_test, y_test = X[test], y[test]

scores = []
for param in [10, 20, 40]:
   dt = DecisionTreeClassifier(min_samples_split=param, random_state=99)
   scores.append((cross_val_score(dt, X_train, y_train, cv=10), param))

best_param = max(scores)[1]
dt = DecisionTreeClassifier(min_samples_split=best_param, random_state=99)
print np.mean(dt.predict(X_test)==y_test) # checking accuracy on testing set

执行 K 折交叉验证：使用相同的训练集与单独的验证集

Performing K-fold Cross-Validation: Using Same Training Set vs. Separate Validation Set

validation

statistics

machine-learning

scikit-learn

cross-validation