R - 交叉验证中的两种预测

Question

当我对我的数据使用交叉验证技术时，它给了我两种类型的预测。 CV预测和预测。这两者有什么区别？我猜 cvpredict 是交叉验证预测，但另一个是什么？

这是我的一些代码：

crossvalpredict <- cv.lm(data = total,form.lm = formula(verim~X4+X4.1),m=5)

这是结果：

fold 1 
Observations in test set: 5 
            3    11    15    22    23
Predicted   28.02 32.21 26.53  25.1 21.28
cvpred      20.23 40.69 26.57  34.1 26.06
verim       30.00 31.00 28.00  24.0 20.00
CV residual  9.77 -9.69  1.43 -10.1 -6.06

Sum of squares = 330    Mean square = 66    n = 5 

fold 2 
Observations in test set: 5 
            2     7    21    24    25
Predicted    28.4  32.0  26.2 19.95  25.9
cvpred       52.0  81.8  36.3 14.28  90.1
verim        30.0  33.0  24.0 21.00  24.0
CV residual -22.0 -48.8 -12.3  6.72 -66.1

Sum of squares = 7428    Mean square = 1486    n = 5 

fold 3 
Observations in test set: 5 
            6    14   18    19    20
Predicted   34.48 36.93 19.0 27.79 25.13
cvpred      37.66 44.54 16.7 21.15  7.91
verim       33.00 35.00 18.0 31.00 26.00
CV residual -4.66 -9.54  1.3  9.85 18.09

Sum of squares = 539    Mean square = 108    n = 5 

fold 4 
Observations in test set: 5 
            1     4     5       9   13
Predicted   31.91 29.07  32.5 32.7685 28.9
cvpred      30.05 28.44  54.9 32.0465 11.4
verim       32.00 27.00  31.0 32.0000 30.0
CV residual  1.95 -1.44 -23.9 -0.0465 18.6

Sum of squares = 924    Mean square = 185    n = 5 

fold 5 
Observations in test set: 5 
            8    10    12     16    17
Predicted    27.8 30.28  26.0 27.856 35.14
cvpred       50.3 33.92  45.8 31.347 29.43
verim        28.0 30.00  24.0 31.000 38.00
CV residual -22.3 -3.92 -21.8 -0.347  8.57

Sum of squares = 1065    Mean square = 213    n = 5 

Overall (Sum over all 5 folds) 
 ms 
411

Answer 1

您可以通过阅读您正在使用的函数的帮助来检查 cv.lm。在那里你会找到这段话：

The input data frame is returned, with additional columns ‘Predicted’ (Predicted values using all observations) and ‘cvpred’ (cross-validation predictions). The cross-validation residual sum of squares (‘ss’) and degrees of freedom (‘df’) are returned as attributes of the data frame.

这表示 Predicted 是使用所有观察值得出的预测值向量。换句话说，这似乎是对您的 "training" 数据做出的预测或做出的 "in sample".

要检查这是否适合您使用 lm:

的相同模型

fit <- lm(verim~X4+X4.1, data=total)

并查看此模型的预测值是否：

predict(fit)

与cv.lm

返回的相同

当我在 R 中的 iris 数据集上尝试时 - cv.lm() 预测返回与 predict(lm) 相同的值。因此，在那种情况下 - 它们是样本内预测，其中模型是使用相同的观察结果进行拟合和使用的。

Answer 2

lm() 不给出 "better results." 我不确定 predict() 和 lm.cv() 是如何相同的。 Predict() returns 每个样本的 Y 预期值，根据拟合模型估计（协变量 (X) 及其相应的估计 Beta 值）。这些 Beta 值和模型误差 (E) 是根据原始数据估算的。通过使用 predict()，您会得到对模型性能过于乐观的估计。这就是为什么它看起来更好。使用迭代样本保持技术（如交叉验证 (CV)），您可以获得更好（更现实）的模型性能估计。偏差最小的估计来自留一法 CV，不确定性（预测误差）最小的估计来自 2-fold (K=2) CV。

R - 交叉验证中的两种预测

R - two types of prediction in cross validation

r

prediction

cross-validation