使用 R caret 包获取 CV 测试折叠分区的预测?
Obtaining predictions on CV test fold partitions with R caret package?
我正在使用插入符号来查找和比较多个模型的预测。我首先将我的数据划分为 5 个交叉验证折叠,然后在 5 个训练数据集中的每一个中使用 10 倍 CV 以 select 最佳模型参数。
单个 glmnet
模型的小型 (n=400) 测试数据集的示例代码:
# Load data & factor admit variable.
> mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
> mydata$admit <- as.factor(mydata$admit)
# Create levels yes/no to make sure the the classprobs get a correct name.
levels(mydata$admit) = c("yes", "no")
# Partition data into 5 folds.
> set.seed(123)
> folds <- createFolds(mydata$admit, k=5)
# Train elastic net logistic regression via 10-fold CV on each of 5 training folds using index argument.
> set.seed(123)
> train_control <- trainControl( method="cv",
number=10,
index=folds,
classProbs = TRUE,
savePredictions = TRUE)
> glmnetGrid <- expand.grid(alpha=c(0, .5, 1), lambda=c(.1, 1, 10))
model<- train(admit ~ .,
data=mydata,
trControl=train_control,
method="glmnet",
family="binomial",
tuneGrid=glmnetGrid,
metric="Accuracy",
preProcess=c("center","scale"))
> model
glmnet
400 samples
3 predictor
2 classes: 'yes', 'no'
Pre-processing: centered (3), scaled (3)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 79, 80, 80, 81, 80
Resampling results across tuning parameters:
alpha lambda Accuracy Kappa Accuracy SD Kappa SD
0.0 0.1 0.6918972780 0.08970669720 0.016425551472 0.08416581606
0.0 1.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
0.0 10.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
0.5 0.1 0.6818893800 0.04127002380 0.008252409699 0.04052581228
0.5 1.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
0.5 10.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
1.0 0.1 0.6800085023 0.02149826881 0.005876570847 0.04807159045
1.0 1.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
1.0 10.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 0.1.
> summary(model$pred)
pred obs rowIndex yes no alpha lambda Resample
yes:14192 yes:9828 Min. : 1.00 Min. :0.2650250 Min. :0.03333769 Min. :0.0 Min. : 0.1 Length:14400
no : 208 no :4572 1st Qu.:100.75 1st Qu.:0.6750000 1st Qu.:0.31250000 1st Qu.:0.0 1st Qu.: 0.1 Class :character
Median :200.50 Median :0.6835443 Median :0.31645570 Median :0.5 Median : 1.0 Mode :character
Mean :200.50 Mean :0.6840322 Mean :0.31596777 Mean :0.5 Mean : 3.7
3rd Qu.:300.25 3rd Qu.:0.6875000 3rd Qu.:0.32500000 3rd Qu.:1.0 3rd Qu.:10.0
Max. :400.00 Max. :0.9666623 Max. :0.73497501 Max. :1.0 Max. :10.0
问题:插入符语法是否允许我为 5 个训练折叠分区中的每一个获得相应最佳拟合模型的 5 个测试折叠预测?
事实上,model$pred
returns 14,400 个预测和整个数据集的最佳拟合模型。我想要 model$pred
到 return n = 5 x 80 = 400 对适合每个训练折叠的 5 个独立模型的预测。
您只需设置 savePredictions = "final"。这应该将输出限制在您需要的范围内。
我正在使用插入符号来查找和比较多个模型的预测。我首先将我的数据划分为 5 个交叉验证折叠,然后在 5 个训练数据集中的每一个中使用 10 倍 CV 以 select 最佳模型参数。
单个 glmnet
模型的小型 (n=400) 测试数据集的示例代码:
# Load data & factor admit variable.
> mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
> mydata$admit <- as.factor(mydata$admit)
# Create levels yes/no to make sure the the classprobs get a correct name.
levels(mydata$admit) = c("yes", "no")
# Partition data into 5 folds.
> set.seed(123)
> folds <- createFolds(mydata$admit, k=5)
# Train elastic net logistic regression via 10-fold CV on each of 5 training folds using index argument.
> set.seed(123)
> train_control <- trainControl( method="cv",
number=10,
index=folds,
classProbs = TRUE,
savePredictions = TRUE)
> glmnetGrid <- expand.grid(alpha=c(0, .5, 1), lambda=c(.1, 1, 10))
model<- train(admit ~ .,
data=mydata,
trControl=train_control,
method="glmnet",
family="binomial",
tuneGrid=glmnetGrid,
metric="Accuracy",
preProcess=c("center","scale"))
> model
glmnet
400 samples
3 predictor
2 classes: 'yes', 'no'
Pre-processing: centered (3), scaled (3)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 79, 80, 80, 81, 80
Resampling results across tuning parameters:
alpha lambda Accuracy Kappa Accuracy SD Kappa SD
0.0 0.1 0.6918972780 0.08970669720 0.016425551472 0.08416581606
0.0 1.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
0.0 10.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
0.5 0.1 0.6818893800 0.04127002380 0.008252409699 0.04052581228
0.5 1.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
0.5 10.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
1.0 0.1 0.6800085023 0.02149826881 0.005876570847 0.04807159045
1.0 1.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
1.0 10.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 0.1.
> summary(model$pred)
pred obs rowIndex yes no alpha lambda Resample
yes:14192 yes:9828 Min. : 1.00 Min. :0.2650250 Min. :0.03333769 Min. :0.0 Min. : 0.1 Length:14400
no : 208 no :4572 1st Qu.:100.75 1st Qu.:0.6750000 1st Qu.:0.31250000 1st Qu.:0.0 1st Qu.: 0.1 Class :character
Median :200.50 Median :0.6835443 Median :0.31645570 Median :0.5 Median : 1.0 Mode :character
Mean :200.50 Mean :0.6840322 Mean :0.31596777 Mean :0.5 Mean : 3.7
3rd Qu.:300.25 3rd Qu.:0.6875000 3rd Qu.:0.32500000 3rd Qu.:1.0 3rd Qu.:10.0
Max. :400.00 Max. :0.9666623 Max. :0.73497501 Max. :1.0 Max. :10.0
问题:插入符语法是否允许我为 5 个训练折叠分区中的每一个获得相应最佳拟合模型的 5 个测试折叠预测?
事实上,model$pred
returns 14,400 个预测和整个数据集的最佳拟合模型。我想要 model$pred
到 return n = 5 x 80 = 400 对适合每个训练折叠的 5 个独立模型的预测。
您只需设置 savePredictions = "final"。这应该将输出限制在您需要的范围内。