R-Caret:如何使用多个模型构建更高效的模型并预测新结果
R-Caret: how to build a more efficient model with multiple models and predict new results
我的训练数据集 (train) 是一个具有 n-features 的数据框和一个包含结果 y。我建立了 3 个个体模型,例如:
m1 <- train(y ~ ., data = train, method = "lda")
m2 <- train(y ~ ., data = train, method = "rf")
m3 <- train(y ~ ., data = train, method = "gbm")
使用测试数据集 (test) 我可以评估这些个体模型的质量(当然,它的结果是 y) :
pred1 <- predict(m1, newdata = test)
pred2 <- predict(m2, newdata = test)
pred3 <- predict(m3, newdata = test)
如果我在数据框中应用每个单独的模型 DATA_TO_PREDICT(结果未知)有 5 个示例,输出自然是每个单独模型的 5 个预测:
predict(m1, DATA_TO_PREDICT)
predict(m2, DATA_TO_PREDICT)
predict(m3, DATA_TO_PREDICT)
现在我想使用 R-Caret-Package 中的组合模型和随机森林:
DF <- data.frame(pred1, pred2, pred3, y = test$y)
MODEL <- train(y ~ ., data = DF, method = "rf")
我可以观察到组合模型的准确度增加了:
predMODEL <- predict(MODEL, DF)
但如果我在 DATA_TO_PREDICT 中应用组合模型(结果未知),输出不仅有 5 个预测,而且有一个巨大的列表,其中有重复的结果大于百。我用过:
predict(MODEL, newdata = DATA_TO_PREDICT)
示例:
这里我给出一个输出错误的具体例子。也就是说,我想预测 4 new 数据,但我得到的结果有几十个输出:
library(caret)
library(gbm)
set.seed(10)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]
inTEST <- (5:nrow(testing))
test <- testing[inTEST,]
DATA_TO_PREDICT <- testing[-inTEST,]
m1 <- train(diagnosis ~ ., data=training, method="rf")
m2 <- train(diagnosis ~ ., data=training, method="gbm")
m3 <- train(diagnosis ~ ., data=training, method="lda")
p1 <- predict(m1, newdata = test)
p2 <- predict(m2, newdata = test)
p3 <- predict(m3, newdata = test)
DF <- data.frame(p1, p2, p3, diagnosis = test$diagnosis)
MODEL <- train(diagnosis ~ ., data = DF, method = "rf")
predMODEL <- predict(MODEL, DF)
那么如果我构建组合模型:
pred1 <- predict(m1, DATA_TO_PREDICT)
pred2 <- predict(m2, DATA_TO_PREDICT)
pred3 <- predict(m3, DATA_TO_PREDICT)
DF2 <- data.frame(pred1, pred2, pred3)
predict(MODEL, newdata = DF2)
请注意 DATA_TO_PREDICT 只有 4 个示例,输出为:
[1] Control Control Control Control Control Control Control Control
[9] Control Control Control Control Control Control Control Control
[17] Control Control Control Control Control Control Control Control
[25] Control Control Control Control Control Control Control Control
[33] Control Control Control Control Control Control Control Control
[41] Control Control Control Control Control Control Control Control
[49] Control Control Control Control Control Control Control Control
[57] Control Control Control Control Control Control Control Control
[65] Control Control Control Control Control Control Control Control
[73] Control Control Control Control Control Control
Levels: Impaired Control
这是因为 MODEL
是根据三个单独模型(测试数据的 pred1
、pred2
和 pred3
的预测进行训练的,并且在最后步骤 DATA_TO_PREDICT
被提供给 MODEL
而是由观察组成。首先,必须存储 DATA_TO_PREDICT
的各个模型的预测值,然后用作 MODEL
的 newdata
。
# (Beginning of the example omitted)
DF <- data.frame(p1, p2, p3, diagnosis = test$diagnosis)
# This trains a model with predictions as inputs:
MODEL <- train(diagnosis ~ ., data = DF, method = "rf")
# This is missing ----------------------
# To get the inputs for the ensemble model
# the predictions for DATA_TO_PREDICT are needed
p1b <- predict(m1, newdata = DATA_TO_PREDICT)
p2b <- predict(m2, newdata = DATA_TO_PREDICT)
p3b <- predict(m3, newdata = DATA_TO_PREDICT)
DFb <- data.frame(p1b, p2b, p3b)
colnames(DFb) <- c("p1", "p2", "p3")
#----------------------------------------
predMODEL <- predict(MODEL, DFb)
# [1] Control Control Control Control
我的训练数据集 (train) 是一个具有 n-features 的数据框和一个包含结果 y。我建立了 3 个个体模型,例如:
m1 <- train(y ~ ., data = train, method = "lda")
m2 <- train(y ~ ., data = train, method = "rf")
m3 <- train(y ~ ., data = train, method = "gbm")
使用测试数据集 (test) 我可以评估这些个体模型的质量(当然,它的结果是 y) :
pred1 <- predict(m1, newdata = test)
pred2 <- predict(m2, newdata = test)
pred3 <- predict(m3, newdata = test)
如果我在数据框中应用每个单独的模型 DATA_TO_PREDICT(结果未知)有 5 个示例,输出自然是每个单独模型的 5 个预测:
predict(m1, DATA_TO_PREDICT)
predict(m2, DATA_TO_PREDICT)
predict(m3, DATA_TO_PREDICT)
现在我想使用 R-Caret-Package 中的组合模型和随机森林:
DF <- data.frame(pred1, pred2, pred3, y = test$y)
MODEL <- train(y ~ ., data = DF, method = "rf")
我可以观察到组合模型的准确度增加了:
predMODEL <- predict(MODEL, DF)
但如果我在 DATA_TO_PREDICT 中应用组合模型(结果未知),输出不仅有 5 个预测,而且有一个巨大的列表,其中有重复的结果大于百。我用过:
predict(MODEL, newdata = DATA_TO_PREDICT)
示例:
这里我给出一个输出错误的具体例子。也就是说,我想预测 4 new 数据,但我得到的结果有几十个输出:
library(caret)
library(gbm)
set.seed(10)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]
inTEST <- (5:nrow(testing))
test <- testing[inTEST,]
DATA_TO_PREDICT <- testing[-inTEST,]
m1 <- train(diagnosis ~ ., data=training, method="rf")
m2 <- train(diagnosis ~ ., data=training, method="gbm")
m3 <- train(diagnosis ~ ., data=training, method="lda")
p1 <- predict(m1, newdata = test)
p2 <- predict(m2, newdata = test)
p3 <- predict(m3, newdata = test)
DF <- data.frame(p1, p2, p3, diagnosis = test$diagnosis)
MODEL <- train(diagnosis ~ ., data = DF, method = "rf")
predMODEL <- predict(MODEL, DF)
那么如果我构建组合模型:
pred1 <- predict(m1, DATA_TO_PREDICT)
pred2 <- predict(m2, DATA_TO_PREDICT)
pred3 <- predict(m3, DATA_TO_PREDICT)
DF2 <- data.frame(pred1, pred2, pred3)
predict(MODEL, newdata = DF2)
请注意 DATA_TO_PREDICT 只有 4 个示例,输出为:
[1] Control Control Control Control Control Control Control Control
[9] Control Control Control Control Control Control Control Control
[17] Control Control Control Control Control Control Control Control
[25] Control Control Control Control Control Control Control Control
[33] Control Control Control Control Control Control Control Control
[41] Control Control Control Control Control Control Control Control
[49] Control Control Control Control Control Control Control Control
[57] Control Control Control Control Control Control Control Control
[65] Control Control Control Control Control Control Control Control
[73] Control Control Control Control Control Control
Levels: Impaired Control
这是因为 MODEL
是根据三个单独模型(测试数据的 pred1
、pred2
和 pred3
的预测进行训练的,并且在最后步骤 DATA_TO_PREDICT
被提供给 MODEL
而是由观察组成。首先,必须存储 DATA_TO_PREDICT
的各个模型的预测值,然后用作 MODEL
的 newdata
。
# (Beginning of the example omitted)
DF <- data.frame(p1, p2, p3, diagnosis = test$diagnosis)
# This trains a model with predictions as inputs:
MODEL <- train(diagnosis ~ ., data = DF, method = "rf")
# This is missing ----------------------
# To get the inputs for the ensemble model
# the predictions for DATA_TO_PREDICT are needed
p1b <- predict(m1, newdata = DATA_TO_PREDICT)
p2b <- predict(m2, newdata = DATA_TO_PREDICT)
p3b <- predict(m3, newdata = DATA_TO_PREDICT)
DFb <- data.frame(p1b, p2b, p3b)
colnames(DFb) <- c("p1", "p2", "p3")
#----------------------------------------
predMODEL <- predict(MODEL, DFb)
# [1] Control Control Control Control