是否有一个简单的命令可以使用 lm() 函数进行留一法交叉验证?
Is there a simple command to do leave-one-out cross validation with the lm() function?
在 R 中是否有一个简单的命令来使用 lm()
函数进行留一法交叉验证?
具体有没有针对下面代码的简单命令?
x <- rnorm(1000,3,2)
y <- 2*x + rnorm(1000)
pred_error_sq <- c(0)
for(i in 1:1000) {
x_i <- x[-i]
y_i <- y[-i]
mdl <- lm(y_i ~ x_i) # leave i'th observation out
y_pred <- predict(mdl, data.frame(x_i = x[i])) # predict i'th observation
pred_error_sq <- pred_error_sq + (y[i] - y_pred)^2 # cumulate squared prediction errors
}
y_squared <- sum((y-mean(y))^2)/100 # Variation of the data
R_squared <- 1 - (pred_error_sq/y_squared) # Measure for goodness of fit
您可以从 DAAG 包中尝试 cv.lm
:
cv.lm(data = DAAG::houseprices, form.lm = formula(sale.price ~ area),
m = 3, dots = FALSE, seed = 29, plotit = c("Observed","Residual"),
main="Small symbols show cross-validation predicted values",
legend.pos="topleft", printit = TRUE)
Arguments
data a data frame
form.lm, a formula or lm call or lm object
m the number of folds
dots uses pch=16 for the plotting character
seed random number generator seed
plotit This can be one of the text strings "Observed", "Residual", or a logical value. The logical TRUE is equivalent to "Observed", while FALSE is equivalent to "" (no plot)
main main title for graph
legend.pos position of legend: one of "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", "center".
printit if TRUE, output is printed to the screen
另一个解决方案是使用 caret
library(caret)
data <- data.frame(x = rnorm(1000, 3, 2), y = 2*x + rnorm(1000))
train(y ~ x, method = "lm", data = data, trControl = trainControl(method = "LOOCV"))
Linear Regression
1000 samples 1 predictor
No pre-processing Resampling: Leave-One-Out Cross-Validation Summary
of sample sizes: 999, 999, 999, 999, 999, 999, ... Resampling
results:
RMSE Rsquared MAE
1.050268 0.940619 0.836808
Tuning parameter 'intercept' was held constant at a value of TRUE
您可以只使用自定义函数,使用统计技巧避免实际计算所有 N 个模型:
loocv=function(fit){
h=lm.influence(fit)$h
mean((residuals(fit)/(1-h))^2)
}
这里有解释:https://gerardnico.com/wiki/lang/r/cross_validation
它仅适用于线性模型
我想您可能想在公式中的均值之后添加一个平方根。
https://www.rdocumentation.org/packages/boot/versions/1.3-20/topics/cv.glm 中的 cv.glm
默认执行 LOOCV,只需要数据和 lm
或 glm
函数。
只需编写您自己的代码,使用索引变量来标记样本外的一个观察值。使用插入符号针对最高票测试此方法。尽管 caret 简单易用,但我的残酷方法花费的时间更少。 (而不是 lm,我使用 LDA,但没有太大区别)
for (index in 1:dim(df)[1]){
# here write your lm function
}
在 R 中是否有一个简单的命令来使用 lm()
函数进行留一法交叉验证?
具体有没有针对下面代码的简单命令?
x <- rnorm(1000,3,2)
y <- 2*x + rnorm(1000)
pred_error_sq <- c(0)
for(i in 1:1000) {
x_i <- x[-i]
y_i <- y[-i]
mdl <- lm(y_i ~ x_i) # leave i'th observation out
y_pred <- predict(mdl, data.frame(x_i = x[i])) # predict i'th observation
pred_error_sq <- pred_error_sq + (y[i] - y_pred)^2 # cumulate squared prediction errors
}
y_squared <- sum((y-mean(y))^2)/100 # Variation of the data
R_squared <- 1 - (pred_error_sq/y_squared) # Measure for goodness of fit
您可以从 DAAG 包中尝试 cv.lm
:
cv.lm(data = DAAG::houseprices, form.lm = formula(sale.price ~ area),
m = 3, dots = FALSE, seed = 29, plotit = c("Observed","Residual"),
main="Small symbols show cross-validation predicted values",
legend.pos="topleft", printit = TRUE)
Arguments
data a data frame
form.lm, a formula or lm call or lm object
m the number of folds
dots uses pch=16 for the plotting character
seed random number generator seed
plotit This can be one of the text strings "Observed", "Residual", or a logical value. The logical TRUE is equivalent to "Observed", while FALSE is equivalent to "" (no plot)
main main title for graph
legend.pos position of legend: one of "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", "center".
printit if TRUE, output is printed to the screen
另一个解决方案是使用 caret
library(caret)
data <- data.frame(x = rnorm(1000, 3, 2), y = 2*x + rnorm(1000))
train(y ~ x, method = "lm", data = data, trControl = trainControl(method = "LOOCV"))
Linear Regression
1000 samples 1 predictor
No pre-processing Resampling: Leave-One-Out Cross-Validation Summary of sample sizes: 999, 999, 999, 999, 999, 999, ... Resampling results:
RMSE Rsquared MAE
1.050268 0.940619 0.836808Tuning parameter 'intercept' was held constant at a value of TRUE
您可以只使用自定义函数,使用统计技巧避免实际计算所有 N 个模型:
loocv=function(fit){
h=lm.influence(fit)$h
mean((residuals(fit)/(1-h))^2)
}
这里有解释:https://gerardnico.com/wiki/lang/r/cross_validation 它仅适用于线性模型 我想您可能想在公式中的均值之后添加一个平方根。
cv.glm
默认执行 LOOCV,只需要数据和 lm
或 glm
函数。
只需编写您自己的代码,使用索引变量来标记样本外的一个观察值。使用插入符号针对最高票测试此方法。尽管 caret 简单易用,但我的残酷方法花费的时间更少。 (而不是 lm,我使用 LDA,但没有太大区别)
for (index in 1:dim(df)[1]){
# here write your lm function
}