R - 具有标准估计误差的线性回归的 k 折交叉验证

Question

我想在 R 中对线性回归模型执行 k 折交叉验证并测试一个标准误差规则：

https://stats.stackexchange.com/questions/17904/one-standard-error-rule-for-variable-selection

因此，我需要一个函数来返回预测误差的交叉验证估计和这个估计的标准误差（或者至少是每次折叠的 MSE ，这样我就可以自己计算标准误差）。许多包都有计算交叉验证误差的函数（例如，boost 包中的 cv.glm），但通常它们 return 只是预测误差的 CV 估计，而不是其标准错误，或每次折叠的 MSE。

我尝试使用包 DAAG，其函数 CVlm 应该比 cv.glm 提供更丰富的输出。但是，我似乎无法让它发挥作用！这是我的代码：

a=c(0.0056, 0.0088, 0.0148, 0.0247, 0.0392, 0.0556, 0.0632, 0.0686, 0.0786, 0.0855, 0.0937)
b=c(6.0813, 9.5011, 15.5194, 23.9409, 32.8492, 40.8399, 43.8760, 45.5270, 46.7668, 46.1587, 43.4524)
dataset=data.frame(x=a,y=b)
CV.list=CVlm(df=dataset,form.lm = formula(y ~ poly(x,2)), m=5)

我收到几乎没有信息的错误

Error in xy.coords(x, y, xlabel, ylabel, log) : 
'x' and 'y' lengths differ

这对我来说意义不大。 x 和 y 的长度相同 (11)，因此很明显该函数在抱怨它在内部创建的其他一些 x、y 变量。

我很乐意接受其他软件包的解决方案（例如 caret）。另外，如果我可以为 k 折交叉验证指定重复次数，那就太好了。

Answer 1

CVlm 不喜欢您公式中的 poly(x,2)。您可以通过首先在数据表中添加 poly(x,2) 的结果并在这些新变量上调用 CVlm 来轻松避免这种情况：

dataset2 <- cbind(dataset,poly(dataset$x,2))
names(dataset2)[3:4] <- c("p1","p2")
CV.list=CVlm(df=dataset2,form.lm = formula(y ~ p1+p2))

当您对打印的值感兴趣时，不幸的是，这些值没有保存在任何地方，您可以使用类似的东西：

# captures the printed output
printOut <- capture.output(CV.list=CVlm(df=dataset2,form.lm = formula(y ~ p1+p2)))

# function to parse the output 
# to be adapted if necessary for your needs
GetValues <- function(itemName,printOut){
    line <- printOut[grep(itemName,printOut)]
    items <- unlist(strsplit(line,"[=]|  +"))
    itemsMat <- matrix(items,ncol=2,byrow=TRUE)
    vectVals <- as.numeric(itemsMat[grep(itemName,itemsMat[,1]),2])
    return(vectVals)
}

# get the Mean square values as a vector
MS <- GetValues("Mean square",printOut)

Answer 2

平均 MSE 存储为模型对象的属性。 attributes(CV.list)$ms 给你你要找的东西。

R - 具有标准估计误差的线性回归的 k 折交叉验证

R - k-fold cross-validation for linear regression with standard error of estimate

r

standard-error

linear-regression

cross-validation

r-caret