在 R(插入符号)中重新运行 preProcess()、predict() 和 train() 时模型精度不同
Different model accuracy when rerunning preProcess(), predict() and train() in R (caret)
下面的数据只是一个例子,它是对这个或任何我感到困惑的数据的操作:
library(caret)
set.seed(3433)
data(AlzheimerDisease)
complete <- data.frame(diagnosis, predictors)
in_train <- createDataPartition(complete$diagnosis, p = 0.75)[[1]]
training <- complete[in_train,]
testing <- complete[-in_train,]
predIL <- grep("^IL", names(training))
smalltrain <- training[, c(1, predIL)]
fit_noPCA <- train(diagnosis ~ ., method = "glm", data = smalltrain)
pre_proc_obj <- preProcess(smalltrain[,-1], method = "pca", thresh = 0.8)
smalltrainsPCs <- predict(pre_proc_obj, smalltrain[,-1])
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm")
fit_noPCA$results$Accuracy
fit_PCA$results$Accuracy
当 运行 这段代码时,fit_noPCA
的精度为 0.689539,fit_PCA
的精度为 0.682951。但是当我重新运行代码的最后一部分时:
fit_noPCA <- train(diagnosis ~ ., method = "glm", data = smalltrain)
pre_proc_obj <- preProcess(smalltrain[,-1], method = "pca", thresh = 0.8)
smalltrainsPCs <- predict(pre_proc_obj, smalltrain[,-1])
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm")
fit_noPCA$results$Accuracy
fit_PCA$results$Accuracy
然后每次我重新运行这 6 行时,我都会得到不同的准确度值。为什么会这样?是因为我没有重置种子吗?即使,这个过程的固有随机性在哪里?
默认情况下,模型是使用bootstrap训练的,你可以在这里看到:
library(caret)
library(AppliedPredictiveModeling)
> fit_noPCA
Generalized Linear Model
251 samples
12 predictor
2 classes: 'Impaired', 'Control'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 251, 251, 251, 251, 251, 251, ...
Resampling results:
Accuracy Kappa
0.6870006 0.04107016
所以对于每个 train
,bootstrapped 样本都会不同,为了得到相同的结果,你可以在 运行 训练之前设置种子:
set.seed(111)
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm",trControl=trainControl(method="boot",number=100))
fit_PCA$results$Accuracy
[1] 0.6983512
set.seed(112)
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm",trControl=trainControl(method="boot",number=100))
fit_PCA$results$Accuracy
[1] 0.6991537
set.seed(111)
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm",trControl=trainControl(method="boot",number=100))
fit_PCA$results$Accuracy
[1] 0.6983512
或者使用例如 cv,您可以在 trainControl
中使用 index=
定义折叠
下面的数据只是一个例子,它是对这个或任何我感到困惑的数据的操作:
library(caret)
set.seed(3433)
data(AlzheimerDisease)
complete <- data.frame(diagnosis, predictors)
in_train <- createDataPartition(complete$diagnosis, p = 0.75)[[1]]
training <- complete[in_train,]
testing <- complete[-in_train,]
predIL <- grep("^IL", names(training))
smalltrain <- training[, c(1, predIL)]
fit_noPCA <- train(diagnosis ~ ., method = "glm", data = smalltrain)
pre_proc_obj <- preProcess(smalltrain[,-1], method = "pca", thresh = 0.8)
smalltrainsPCs <- predict(pre_proc_obj, smalltrain[,-1])
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm")
fit_noPCA$results$Accuracy
fit_PCA$results$Accuracy
当 运行 这段代码时,fit_noPCA
的精度为 0.689539,fit_PCA
的精度为 0.682951。但是当我重新运行代码的最后一部分时:
fit_noPCA <- train(diagnosis ~ ., method = "glm", data = smalltrain)
pre_proc_obj <- preProcess(smalltrain[,-1], method = "pca", thresh = 0.8)
smalltrainsPCs <- predict(pre_proc_obj, smalltrain[,-1])
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm")
fit_noPCA$results$Accuracy
fit_PCA$results$Accuracy
然后每次我重新运行这 6 行时,我都会得到不同的准确度值。为什么会这样?是因为我没有重置种子吗?即使,这个过程的固有随机性在哪里?
默认情况下,模型是使用bootstrap训练的,你可以在这里看到:
library(caret)
library(AppliedPredictiveModeling)
> fit_noPCA
Generalized Linear Model
251 samples
12 predictor
2 classes: 'Impaired', 'Control'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 251, 251, 251, 251, 251, 251, ...
Resampling results:
Accuracy Kappa
0.6870006 0.04107016
所以对于每个 train
,bootstrapped 样本都会不同,为了得到相同的结果,你可以在 运行 训练之前设置种子:
set.seed(111)
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm",trControl=trainControl(method="boot",number=100))
fit_PCA$results$Accuracy
[1] 0.6983512
set.seed(112)
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm",trControl=trainControl(method="boot",number=100))
fit_PCA$results$Accuracy
[1] 0.6991537
set.seed(111)
fit_PCA <- train(x = smalltrainsPCs, y = smalltrain$diagnosis, method = "glm",trControl=trainControl(method="boot",number=100))
fit_PCA$results$Accuracy
[1] 0.6983512
或者使用例如 cv,您可以在 trainControl
index=
定义折叠