如何获得 R 中 k 折交叉验证的每一折的系数、z 分数和 p 值?
How do I obtain the coefficients, z scores, and p-values for each fold of a k-fold cross validation in R?
我正在使用 glm 执行 5 折交叉验证以执行逻辑回归。这是一个使用内置汽车数据集的可重现示例
library(caret)
data("mtcars")
str(mtcars)
mtcars$vs<-as.factor(mtcars$vs)
df0<-na.omit(mtcars)
set.seed(123)
train.control <- trainControl(method = "cv", number = 5)
# Train the model
model <- train(vs ~., data = mtcars, method = "glm",
trControl = train.control)
print(model)
summary(model)
model$resample
confusionMatrix(model)
pred.mod <- predict(model)
confusionMatrix(data=pred.mod, reference=mtcars$vs)
输出
> print(model)
Generalized Linear Model
32 samples
10 predictors
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 25, 26, 25, 27, 25
Resampling results:
Accuracy Kappa
0.9095238 0.8164638
> summary(model)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-1.181e-05 -2.110e-08 -2.110e-08 2.110e-08 1.181e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.117e+01 1.589e+07 0 1
mpg 2.451e+00 5.979e+04 0 1
cyl -3.908e+01 2.947e+05 0 1
disp -1.927e-02 8.518e+03 0 1
hp 3.129e-01 2.283e+04 0 1
drat -2.735e+01 9.696e+05 0 1
wt -1.248e+01 6.437e+05 0 1
qsec 1.565e+01 3.845e+05 0 1
am -4.562e+01 3.632e+05 0 1
gear -2.835e+01 5.448e+05 0 1
carb 1.788e+01 2.971e+05 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 7.2154e-10 on 21 degrees of freedom
AIC: 22
Number of Fisher Scoring iterations: 25
> model$resample
Accuracy Kappa Resample
1 0.8571429 0.6956522 Fold1
2 0.8333333 0.6666667 Fold2
3 0.8571429 0.7200000 Fold3
4 1.0000000 1.0000000 Fold4
5 1.0000000 1.0000000 Fold5
> confusionMatrix(model)
Cross-Validated (5 fold) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction 0 1
0 50.0 3.1
1 6.2 40.6
Accuracy (average) : 0.9062
> pred.mod <- predict(model)
> confusionMatrix(data=pred.mod, reference=mtcars$vs)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 18 0
1 0 14
Accuracy : 1
95% CI : (0.8911, 1)
No Information Rate : 0.5625
P-Value [Acc > NIR] : 1.009e-08
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.5625
Detection Rate : 0.5625
Detection Prevalence : 0.5625
Balanced Accuracy : 1.0000
'Positive' Class : 0
这一切都很好,但我想获得每个折叠的摘要(模型)信息(意味着执行 summary() 时获得的系数、p 值、z 分数等),
如果可能,还包括每个折叠的灵敏度和特异性。有人可以帮忙吗?
是个有趣的问题。您正在寻找的值不能直接从 model
对象获得,但可以通过了解训练数据的哪些观察是哪个折叠的一部分来重新计算。如果您在 trainControl
函数中指定 savePredictions = "all"
,则可以从 model
中提取此信息。通过对每 k 次折叠的预测,您可以执行以下操作:
#first of all, save all predictions from all folds
set.seed(123)
train.control <- trainControl(method = "cv", number = 5,savePredictions =
"all")
# Train the model
model <- train(vs ~., data = mtcars, method = "glm",
trControl = train.control)
#now we can extract the statistics you are looking for
fold <- unique(pred$Resample)
mystat <- function(model,x){
pred <- model$pred
df <- pred[pred$Resample==x,]
cm <- confusionMatrix(df$pred,df$obs)
control <- trainControl(method = "none")
newdat <- mtcars[pred$rowIndex,]
fit <- train(vs~.,data=newdat,trControl=control)
summ <- summary(model)
z_p <- summ$coefficients[,3:4]
return(list(cm,z_p))
}
stat <- lapply(fold, mystat,model=model)
names(stat) <- fold
请注意,通过在 trainControl
中指定 method="none"
强制 train
将模型拟合到整个训练集,而无需任何重采样或参数调整。
在这种形式下,它不是一个漂亮的函数,但它可以做你想做的,你可以随时调整它以使其更通用。
我正在使用 glm 执行 5 折交叉验证以执行逻辑回归。这是一个使用内置汽车数据集的可重现示例
library(caret)
data("mtcars")
str(mtcars)
mtcars$vs<-as.factor(mtcars$vs)
df0<-na.omit(mtcars)
set.seed(123)
train.control <- trainControl(method = "cv", number = 5)
# Train the model
model <- train(vs ~., data = mtcars, method = "glm",
trControl = train.control)
print(model)
summary(model)
model$resample
confusionMatrix(model)
pred.mod <- predict(model)
confusionMatrix(data=pred.mod, reference=mtcars$vs)
输出
> print(model)
Generalized Linear Model
32 samples
10 predictors
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 25, 26, 25, 27, 25
Resampling results:
Accuracy Kappa
0.9095238 0.8164638
> summary(model)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-1.181e-05 -2.110e-08 -2.110e-08 2.110e-08 1.181e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.117e+01 1.589e+07 0 1
mpg 2.451e+00 5.979e+04 0 1
cyl -3.908e+01 2.947e+05 0 1
disp -1.927e-02 8.518e+03 0 1
hp 3.129e-01 2.283e+04 0 1
drat -2.735e+01 9.696e+05 0 1
wt -1.248e+01 6.437e+05 0 1
qsec 1.565e+01 3.845e+05 0 1
am -4.562e+01 3.632e+05 0 1
gear -2.835e+01 5.448e+05 0 1
carb 1.788e+01 2.971e+05 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 7.2154e-10 on 21 degrees of freedom
AIC: 22
Number of Fisher Scoring iterations: 25
> model$resample
Accuracy Kappa Resample
1 0.8571429 0.6956522 Fold1
2 0.8333333 0.6666667 Fold2
3 0.8571429 0.7200000 Fold3
4 1.0000000 1.0000000 Fold4
5 1.0000000 1.0000000 Fold5
> confusionMatrix(model)
Cross-Validated (5 fold) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction 0 1
0 50.0 3.1
1 6.2 40.6
Accuracy (average) : 0.9062
> pred.mod <- predict(model)
> confusionMatrix(data=pred.mod, reference=mtcars$vs)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 18 0
1 0 14
Accuracy : 1
95% CI : (0.8911, 1)
No Information Rate : 0.5625
P-Value [Acc > NIR] : 1.009e-08
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.5625
Detection Rate : 0.5625
Detection Prevalence : 0.5625
Balanced Accuracy : 1.0000
'Positive' Class : 0
这一切都很好,但我想获得每个折叠的摘要(模型)信息(意味着执行 summary() 时获得的系数、p 值、z 分数等), 如果可能,还包括每个折叠的灵敏度和特异性。有人可以帮忙吗?
是个有趣的问题。您正在寻找的值不能直接从 model
对象获得,但可以通过了解训练数据的哪些观察是哪个折叠的一部分来重新计算。如果您在 trainControl
函数中指定 savePredictions = "all"
,则可以从 model
中提取此信息。通过对每 k 次折叠的预测,您可以执行以下操作:
#first of all, save all predictions from all folds
set.seed(123)
train.control <- trainControl(method = "cv", number = 5,savePredictions =
"all")
# Train the model
model <- train(vs ~., data = mtcars, method = "glm",
trControl = train.control)
#now we can extract the statistics you are looking for
fold <- unique(pred$Resample)
mystat <- function(model,x){
pred <- model$pred
df <- pred[pred$Resample==x,]
cm <- confusionMatrix(df$pred,df$obs)
control <- trainControl(method = "none")
newdat <- mtcars[pred$rowIndex,]
fit <- train(vs~.,data=newdat,trControl=control)
summ <- summary(model)
z_p <- summ$coefficients[,3:4]
return(list(cm,z_p))
}
stat <- lapply(fold, mystat,model=model)
names(stat) <- fold
请注意,通过在 trainControl
中指定 method="none"
强制 train
将模型拟合到整个训练集,而无需任何重采样或参数调整。
在这种形式下,它不是一个漂亮的函数,但它可以做你想做的,你可以随时调整它以使其更通用。