使用 mlr3 结合弹性网和逻辑回归的两级堆叠学习器(集成模型)
Two-level stacked learner (enseble model) combining elastic net and logistic regression using mlr3
我尝试解决医学中的一个常见问题:预测模型与其他来源的组合,例如,专家意见[有时在医学上被高度强调],称为superdoc
这个 post.
中的预测变量
这可以通过将模型与逻辑回归(输入专家意见)叠加来解决,如本文第 26 页所述:
Afshar P, Mohammadi A, Plataniotis KN, Oikonomou A, Benali H. From
Handcrafted to Deep-Learning-Based Cancer Radiomics: Challenges and
Opportunities. IEEE Signal Process Mag 2019; 36: 132–60. Available here
我试过这个 没有考虑过度拟合(我没有应用低级学习者的折叠预测):
示例数据
# library
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)
# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes
# add the super doctors opinion to the data
set.seed(2323)
data %>%
rowwise() %>%
mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)), TRUE~ 0)) -> data
# separate the data in a training set and test set
train.data <- data[1:550,]
test.data <- data[551:768,]
不考虑折叠预测的堆叠模型:
# elastic net regression (without the superdoc's opinion)
set.seed(2323)
model <- train(
diabetes ~., data = train.data %>% select(-superdoc), method = "glmnet",
trControl = trainControl("repeatedcv",
number = 10,
repeats=10,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = twoClassSummary),
tuneLength = 10,
metric="ROC" #ROC metric is in twoClassSummary
)
# extract the coefficients for the best alpha and lambda
coef(model$finalModel, model$finalModel$lambdaOpt) -> coeffs
tidy(coeffs) %>% tibble() -> coeffs
coef.interc = coeffs %>% filter(row=="(Intercept)") %>% pull(value)
coef.pregnant = coeffs %>% filter(row=="pregnant") %>% pull(value)
coef.glucose = coeffs %>% filter(row=="glucose") %>% pull(value)
coef.pressure = coeffs %>% filter(row=="pressure") %>% pull(value)
coef.mass = coeffs %>% filter(row=="mass") %>% pull(value)
coef.pedigree = coeffs %>% filter(row=="pedigree") %>% pull(value)
coef.age = coeffs %>% filter(row=="age") %>% pull(value)
# combine the model with the superdoc's opinion in a logistic regression model
finalmodel = glm(diabetes ~ superdoc + I(coef.interc + coef.pregnant*pregnant + coef.glucose*glucose + coef.pressure*pressure + coef.mass*mass + coef.pedigree*pedigree + coef.age*age),family=binomial, data=train.data)
# make predictions on the test data
predict(finalmodel,test.data, type="response") -> predictions
# check the AUC of the model in the test data
roc(test.data$diabetes,predictions, ci=TRUE)
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#>
#> Call:
#> roc.default(response = test.data$diabetes, predictor = predictions, ci = TRUE)
#>
#> Data: predictions in 145 controls (test.data$diabetes neg) < 73 cases (test.data$diabetes pos).
#> Area under the curve: 0.9345
#> 95% CI: 0.8969-0.9721 (DeLong)
现在我想根据这个非常有用的 post: Tuning a stacked learner 使用 mlr3
包系列来考虑折叠预测
#library
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(mlr3filters)
library(mlr3tuning)
library(paradox)
library(glmnet)
# creat elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet", predict_type = "prob")
# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet") #I could not find a setting to filter the predictors (ie, not send the superdoc predictor here)
# summarize steps
level0 = gunion(list(
glmnet_cv1,
po("nop", id = "only_superdoc_predictor"))) %>>% #I could not find a setting to send only the superdoc predictor to "union1"
po("featureunion", id = "union1")
# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")
# combine ensemble model
ensemble = level0 %>>% log_reg_lrn
ensemble$plot(html = FALSE)
由 reprex package (v1.0.0)
于 2021-03-15 创建
我的问题(我对 mlr3
软件包家族比较陌生)
mlr3
包系列是否非常适合我尝试构建的集成模型?
- 如果是,我最终确定集成模型并在
test.data
上做出预测有多冷
我认为 mlr3
/ mlr3pipelines
非常适合您的任务。看来您缺少的主要是 PipeOpSelect
/ po("select")
, which lets you extract features based on their name or other properties and makes use of Selector
对象。您的代码可能看起来像
library("mlr3")
library("mlr3pipelines")
library("mlr3learners")
# creat elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet", predict_type = "prob")
# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet")
# PipeOp that drops 'superdoc', i.e. selects all except 'superdoc'
# (ID given to avoid ID clash with other selector)
drop_superdoc = po("select", id = "drop.superdoc",
selector = selector_invert(selector_name("superdoc")))
# PipeOp that selects 'superdoc' (and drops all other columns)
select_superdoc = po("select", id = "select.superdoc",
selector = selector_name("superdoc"))
# superdoc along one path, the fitted model along the other
stacking_layer = gunion(list(
select_superdoc,
drop_superdoc %>>% glmnet_cv1
)) %>>% po("featureunion", id = "union1")
# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")
# combine ensemble model
ensemble = stacking_layer %>>% log_reg_lrn
这是它的样子:
ensemble$plot(html = FALSE)
为了训练和评估模型,我们需要创建 Task
个对象:
train.task <- TaskClassif$new("train.data", train.data, target = "diabetes")
test.task <- TaskClassif$new("test.data", test.data, target = "diabetes")
现在可以训练模型,然后可以将其用于预测,并且可以评估预测的质量。如果我们把 ensemble
变成 Learner
:
效果最好
elearner = as_learner(ensemble)
# Train the Learner:
elearner$train(train.task)
# (The training may give a warning because the glm gets the colinear features:
# The positive and the negative probabilities)
获取测试集的预测:
prediction = elearner$predict(test.task)
print(prediction)
#> <PredictionClassif> for 218 observations:
#> row_ids truth response prob.neg prob.pos
#> 1 neg neg 0.9417067 0.05829330
#> 2 neg neg 0.9546343 0.04536566
#> 3 neg neg 0.9152019 0.08479810
#> ---
#> 216 neg neg 0.9147406 0.08525943
#> 217 pos neg 0.9078216 0.09217836
#> 218 neg neg 0.9578515 0.04214854
预测是在 Task
上进行的,因此它可以直接用于根据真实情况衡量性能,例如使用 "classif.auc"
Measure
:
msr("classif.auc")$score(prediction)
#> [1] 0.9308455
这里有两个注意事项:
- 您已手动将数据拆分为训练集和测试集。
mlr3
使您可以根据单个 Task
对象自动执行 resampling。这可以超越简单的训练测试拆分。使用问题中的 data
并进行 10 折交叉验证如下所示:
all.task <- TaskClassif$new("all.data", data, target = "diabetes")
rr = resample(all.task, elearner, rsmp("cv")) # will take some time
rr$aggregate(msr("classif.auc"))
#> classif.auc
#> 0.9366438
- 我已经展示了如何使用
po("select")
PipeOp
构建图形,因为它是完全通用的:您可以选择在 glmnet_lrn
Learner
,以及直接在 log_reg_lrn
中,通过使用 selector
值。如果您真正想要做的只是从单个操作中“转移”一个功能,您还可以使用 affect_columns
到 Selector
来选择您想要的列。下面创建了一个(线性)图,它的功能完全相同,但灵活性较低:
glmnet_cv1_nosuperdoc = po("learner_cv", glmnet_lrn, id = "glmnet",
affect_columns = selector_invert(selector_name("superdoc")))
ensemble2 = glmnet_cv1_nosuperdoc %>>% log_reg_lrn
e2learner = as_learner(ensemble2)
# etc.
我尝试解决医学中的一个常见问题:预测模型与其他来源的组合,例如,专家意见[有时在医学上被高度强调],称为superdoc
这个 post.
这可以通过将模型与逻辑回归(输入专家意见)叠加来解决,如本文第 26 页所述:
Afshar P, Mohammadi A, Plataniotis KN, Oikonomou A, Benali H. From Handcrafted to Deep-Learning-Based Cancer Radiomics: Challenges and Opportunities. IEEE Signal Process Mag 2019; 36: 132–60. Available here
我试过这个
示例数据
# library
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)
# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes
# add the super doctors opinion to the data
set.seed(2323)
data %>%
rowwise() %>%
mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)), TRUE~ 0)) -> data
# separate the data in a training set and test set
train.data <- data[1:550,]
test.data <- data[551:768,]
不考虑折叠预测的堆叠模型:
# elastic net regression (without the superdoc's opinion)
set.seed(2323)
model <- train(
diabetes ~., data = train.data %>% select(-superdoc), method = "glmnet",
trControl = trainControl("repeatedcv",
number = 10,
repeats=10,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = twoClassSummary),
tuneLength = 10,
metric="ROC" #ROC metric is in twoClassSummary
)
# extract the coefficients for the best alpha and lambda
coef(model$finalModel, model$finalModel$lambdaOpt) -> coeffs
tidy(coeffs) %>% tibble() -> coeffs
coef.interc = coeffs %>% filter(row=="(Intercept)") %>% pull(value)
coef.pregnant = coeffs %>% filter(row=="pregnant") %>% pull(value)
coef.glucose = coeffs %>% filter(row=="glucose") %>% pull(value)
coef.pressure = coeffs %>% filter(row=="pressure") %>% pull(value)
coef.mass = coeffs %>% filter(row=="mass") %>% pull(value)
coef.pedigree = coeffs %>% filter(row=="pedigree") %>% pull(value)
coef.age = coeffs %>% filter(row=="age") %>% pull(value)
# combine the model with the superdoc's opinion in a logistic regression model
finalmodel = glm(diabetes ~ superdoc + I(coef.interc + coef.pregnant*pregnant + coef.glucose*glucose + coef.pressure*pressure + coef.mass*mass + coef.pedigree*pedigree + coef.age*age),family=binomial, data=train.data)
# make predictions on the test data
predict(finalmodel,test.data, type="response") -> predictions
# check the AUC of the model in the test data
roc(test.data$diabetes,predictions, ci=TRUE)
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#>
#> Call:
#> roc.default(response = test.data$diabetes, predictor = predictions, ci = TRUE)
#>
#> Data: predictions in 145 controls (test.data$diabetes neg) < 73 cases (test.data$diabetes pos).
#> Area under the curve: 0.9345
#> 95% CI: 0.8969-0.9721 (DeLong)
现在我想根据这个非常有用的 post: Tuning a stacked learner 使用 mlr3
包系列来考虑折叠预测
#library
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(mlr3filters)
library(mlr3tuning)
library(paradox)
library(glmnet)
# creat elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet", predict_type = "prob")
# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet") #I could not find a setting to filter the predictors (ie, not send the superdoc predictor here)
# summarize steps
level0 = gunion(list(
glmnet_cv1,
po("nop", id = "only_superdoc_predictor"))) %>>% #I could not find a setting to send only the superdoc predictor to "union1"
po("featureunion", id = "union1")
# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")
# combine ensemble model
ensemble = level0 %>>% log_reg_lrn
ensemble$plot(html = FALSE)
由 reprex package (v1.0.0)
于 2021-03-15 创建我的问题(我对 mlr3
软件包家族比较陌生)
mlr3
包系列是否非常适合我尝试构建的集成模型?- 如果是,我最终确定集成模型并在
test.data
上做出预测有多冷
我认为 mlr3
/ mlr3pipelines
非常适合您的任务。看来您缺少的主要是 PipeOpSelect
/ po("select")
, which lets you extract features based on their name or other properties and makes use of Selector
对象。您的代码可能看起来像
library("mlr3")
library("mlr3pipelines")
library("mlr3learners")
# creat elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet", predict_type = "prob")
# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet")
# PipeOp that drops 'superdoc', i.e. selects all except 'superdoc'
# (ID given to avoid ID clash with other selector)
drop_superdoc = po("select", id = "drop.superdoc",
selector = selector_invert(selector_name("superdoc")))
# PipeOp that selects 'superdoc' (and drops all other columns)
select_superdoc = po("select", id = "select.superdoc",
selector = selector_name("superdoc"))
# superdoc along one path, the fitted model along the other
stacking_layer = gunion(list(
select_superdoc,
drop_superdoc %>>% glmnet_cv1
)) %>>% po("featureunion", id = "union1")
# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")
# combine ensemble model
ensemble = stacking_layer %>>% log_reg_lrn
这是它的样子:
ensemble$plot(html = FALSE)
为了训练和评估模型,我们需要创建 Task
个对象:
train.task <- TaskClassif$new("train.data", train.data, target = "diabetes")
test.task <- TaskClassif$new("test.data", test.data, target = "diabetes")
现在可以训练模型,然后可以将其用于预测,并且可以评估预测的质量。如果我们把 ensemble
变成 Learner
:
elearner = as_learner(ensemble)
# Train the Learner:
elearner$train(train.task)
# (The training may give a warning because the glm gets the colinear features:
# The positive and the negative probabilities)
获取测试集的预测:
prediction = elearner$predict(test.task)
print(prediction)
#> <PredictionClassif> for 218 observations:
#> row_ids truth response prob.neg prob.pos
#> 1 neg neg 0.9417067 0.05829330
#> 2 neg neg 0.9546343 0.04536566
#> 3 neg neg 0.9152019 0.08479810
#> ---
#> 216 neg neg 0.9147406 0.08525943
#> 217 pos neg 0.9078216 0.09217836
#> 218 neg neg 0.9578515 0.04214854
预测是在 Task
上进行的,因此它可以直接用于根据真实情况衡量性能,例如使用 "classif.auc"
Measure
:
msr("classif.auc")$score(prediction)
#> [1] 0.9308455
这里有两个注意事项:
- 您已手动将数据拆分为训练集和测试集。
mlr3
使您可以根据单个Task
对象自动执行 resampling。这可以超越简单的训练测试拆分。使用问题中的data
并进行 10 折交叉验证如下所示:all.task <- TaskClassif$new("all.data", data, target = "diabetes") rr = resample(all.task, elearner, rsmp("cv")) # will take some time rr$aggregate(msr("classif.auc")) #> classif.auc #> 0.9366438
- 我已经展示了如何使用
po("select")
PipeOp
构建图形,因为它是完全通用的:您可以选择在glmnet_lrn
Learner
,以及直接在log_reg_lrn
中,通过使用selector
值。如果您真正想要做的只是从单个操作中“转移”一个功能,您还可以使用affect_columns
到Selector
来选择您想要的列。下面创建了一个(线性)图,它的功能完全相同,但灵活性较低:glmnet_cv1_nosuperdoc = po("learner_cv", glmnet_lrn, id = "glmnet", affect_columns = selector_invert(selector_name("superdoc"))) ensemble2 = glmnet_cv1_nosuperdoc %>>% log_reg_lrn e2learner = as_learner(ensemble2) # etc.