train-test 拆分的缺点

Question

"Train/test split does have its dangers — what if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? (imagine a file ordered by one of these). This will result in overfitting, even though we’re trying to avoid it! This is where cross validation comes in." 以上是提到的大部分博客，我看不懂。我认为缺点不是过拟合而是欠拟合。当我们拆分数据时，假设状态 A 和 B 成为训练数据集，并尝试预测与训练数据完全不同的状态 C，这将导致欠拟合。有人可以告诉我为什么大多数博客都说 'test-split' 会导致过度拟合。

Answer 1

谈论 selection bias 会更正确，这正是您的问题所描述的。

选择偏差并不能真正与过度拟合相关联，而是与有偏差的集合相关联，因此模型将无法generalize/predict正确。

换句话说，无论"fitting"还是"overfitting"适用于偏向的训练集，这仍然是错误的。

"over" 前缀的语义紧张就是这样。这意味着偏见。

假设您没有选择偏差。在这种情况下，根据过度拟合的定义，当你过度拟合一个健康的集合时，你仍然会使模型偏向你的训练集。

在这里，您的起始训练集已经有偏差。所以任何拟合，甚至 "correct fitting"，都会有偏差，就像发生在过度拟合中一样。

Answer 2

其实train/test拆分确实有一定的随机性。请参阅下面的 sci-kit learn train_test_split

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

在这里，为了有一些初步的直觉，您可以将 random_state 值更改为某个随机整数并多次训练模型，看看是否可以在每个 [=20] 中获得可比较的测试精度=].如果数据集很小（大约 100 秒），则测试精度可能会有很大差异。但是当你有一个更大的数据集（大约 10000s）时，测试准确度或多或少变得相似，因为训练集至少包括所有样本中的一些例子。

当然交叉验证是为了最小化过拟合的影响，让结果更泛化。但是对于太大的数据集，做交叉验证真的很昂贵。

Answer 3

"train_test_split"函数如果只在一个数据集上做一次，不一定会有偏差。我的意思是，通过为函数的 "random_state" 特征选择一个值，您可以制作不同组的训练和测试数据集。假设您有一个数据集，在应用 train_test_split 并训练您的模型后，您在测试数据上的准确性得分很低。如果您更改 random_state 值并重新训练您的模型，您将在数据集上获得不同的准确度分数。因此，您基本上可能会想找到 random_state 特征的最佳值，以最准确的方式训练您的模型。好吧，你猜怎么着？你刚刚给你的模型引入了偏见。所以你已经找到了一个训练集，它可以以在测试集上表现最好的方式训练你的模型。然而，当我们使用诸如 KFold 交叉验证之类的东西时，我们将数据集分解为五组或十组（取决于大小）训练和测试数据集。每次我们训练模型时，我们都可以看到不同的分数。当对整个数据集进行训练时，所有分数的平均值对于模型来说可能更真实。它看起来像这样：

from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression 
kfold = KFold(5, True, 1)
R_2 = []
  for train_index, test_index in kfold.split(X):

     X_train, X_test = X.loc[train_index], X.loc[test_index]
     y_train, y_test = y.loc[train_index], y.loc[test_index]
     Model = LinearRegression().fit(X_train, y_train)

     r2 = metrics.r2_score(y_test, Model.predict(X_test))
     R_2.append(r2)

  R_2mean = np.mean(R_2)

train-test 拆分的缺点

Disadvantages of train-test split

statistics

machine-learning

cross-validation

data-science