mlr 中装袋包装器可能存在的错误
Possible bug with bagging wrapper in mlr
bagging wrapper 似乎给出了奇怪的结果。如果我将它应用于简单的逻辑回归,那么 logloss 会放大 10 倍:
library(mlbench)
library(mlr)
data(PimaIndiansDiabetes)
trainTask1 <- makeClassifTask(data = PimaIndiansDiabetes,target = "diabetes",positive = "pos")
bagged.lrn = makeBaggingWrapper(makeLearner("classif.logreg"), bw.iters = 10, bw.replace = TRUE, bw.size = 0.8, bw.feats = 1)
bagged.lrn = setPredictType(bagged.lrn,"prob")
non.bagged.lrn = setPredictType(makeLearner("classif.logreg"),"prob")
rdesc = makeResampleDesc("CV", iters = 5L)
resample(learner = non.bagged.lrn, task = trainTask1, resampling = rdesc, show.info = FALSE,measures = logloss)
resample(learner = bagged.lrn, task = trainTask1, resampling = rdesc, show.info = FALSE,measures = logloss)
给予
Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg
logloss.aggr: 0.49
logloss.mean: 0.49
logloss.sd: 0.02
Runtime: 0.0699999
第一个学习者和
Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg.bagged
logloss.aggr: 5.41
logloss.mean: 5.41
logloss.sd: 0.80
运行时间:0.645
袋装的。因此袋装的性能要差得多。
是有错误还是我做错了什么?
这是我的sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mlr_2.9 stringi_1.1.1 ParamHelpers_1.8 ggplot2_2.1.0 BBmisc_1.10 mlbench_2.1-1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.6 magrittr_1.5 splines_3.3.1 munsell_0.4.3 lattice_0.20-33 xtable_1.8-2 colorspace_1.2-6
[8] R6_2.1.2 plyr_1.8.4 dplyr_0.5.0 tools_3.3.1 parallel_3.3.1 grid_3.3.1 checkmate_1.8.1
[15] data.table_1.9.6 gtable_0.2.0 DBI_0.4-1 htmltools_0.3.5 ggvis_0.4.3 survival_2.39-4 assertthat_0.1
[22] digest_0.6.9 tibble_1.1 Matrix_1.2-6 shiny_0.13.2 mime_0.5 parallelMap_1.3 scales_0.4.0
[29] backports_1.0.3 httpuv_1.3.3 chron_2.3-47
这个结果不一定有什么问题,尽管可以更好地指定装袋模型。
Bagging 不一定总能为您提供更好的性能统计数据,但它可以帮助您避免过度拟合并提高准确性。
因此,您的非装袋模型具有更好的性能统计数据的原因可能仅仅是因为它过度拟合或以其他方式产生具有误导性性能统计数据的更偏向的结果。
但是,这里有一个大大改进的 bagging 模型规范,它使平均 logloss 降低了 70%:
pacman::p_load(mlbench,mlr)
data(PimaIndiansDiabetes)
set.seed(1)
trainTask1 <- makeClassifTask(data = PimaIndiansDiabetes,target = "diabetes",positive = "pos")
bagged.lrn = makeBaggingWrapper(makeLearner("classif.logreg"),
bw.iters = 100,
bw.replace = TRUE,
bw.size = .6,
bw.feats = .5)
bagged.lrn = setPredictType(bagged.lrn,"prob")
non.bagged.lrn = setPredictType(makeLearner("classif.logreg"),"prob")
rdesc = makeResampleDesc("CV", iters = 10L)
resample(learner = non.bagged.lrn,
task = trainTask1,
resampling = rdesc,
show.info = T,
measures = logloss)
resample(learner = bagged.lrn,
task = trainTask1,
resampling = rdesc,
show.info = T,
measures = logloss)
关键结果是
Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg.bagged
logloss.aggr: 1.65
logloss.mean: 1.65
logloss.sd: 0.90
Runtime: 14.0544
bagging wrapper 似乎给出了奇怪的结果。如果我将它应用于简单的逻辑回归,那么 logloss 会放大 10 倍:
library(mlbench)
library(mlr)
data(PimaIndiansDiabetes)
trainTask1 <- makeClassifTask(data = PimaIndiansDiabetes,target = "diabetes",positive = "pos")
bagged.lrn = makeBaggingWrapper(makeLearner("classif.logreg"), bw.iters = 10, bw.replace = TRUE, bw.size = 0.8, bw.feats = 1)
bagged.lrn = setPredictType(bagged.lrn,"prob")
non.bagged.lrn = setPredictType(makeLearner("classif.logreg"),"prob")
rdesc = makeResampleDesc("CV", iters = 5L)
resample(learner = non.bagged.lrn, task = trainTask1, resampling = rdesc, show.info = FALSE,measures = logloss)
resample(learner = bagged.lrn, task = trainTask1, resampling = rdesc, show.info = FALSE,measures = logloss)
给予
Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg
logloss.aggr: 0.49
logloss.mean: 0.49
logloss.sd: 0.02
Runtime: 0.0699999
第一个学习者和
Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg.bagged
logloss.aggr: 5.41
logloss.mean: 5.41
logloss.sd: 0.80
运行时间:0.645
袋装的。因此袋装的性能要差得多。 是有错误还是我做错了什么?
这是我的sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mlr_2.9 stringi_1.1.1 ParamHelpers_1.8 ggplot2_2.1.0 BBmisc_1.10 mlbench_2.1-1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.6 magrittr_1.5 splines_3.3.1 munsell_0.4.3 lattice_0.20-33 xtable_1.8-2 colorspace_1.2-6
[8] R6_2.1.2 plyr_1.8.4 dplyr_0.5.0 tools_3.3.1 parallel_3.3.1 grid_3.3.1 checkmate_1.8.1
[15] data.table_1.9.6 gtable_0.2.0 DBI_0.4-1 htmltools_0.3.5 ggvis_0.4.3 survival_2.39-4 assertthat_0.1
[22] digest_0.6.9 tibble_1.1 Matrix_1.2-6 shiny_0.13.2 mime_0.5 parallelMap_1.3 scales_0.4.0
[29] backports_1.0.3 httpuv_1.3.3 chron_2.3-47
这个结果不一定有什么问题,尽管可以更好地指定装袋模型。
Bagging 不一定总能为您提供更好的性能统计数据,但它可以帮助您避免过度拟合并提高准确性。
因此,您的非装袋模型具有更好的性能统计数据的原因可能仅仅是因为它过度拟合或以其他方式产生具有误导性性能统计数据的更偏向的结果。
但是,这里有一个大大改进的 bagging 模型规范,它使平均 logloss 降低了 70%:
pacman::p_load(mlbench,mlr)
data(PimaIndiansDiabetes)
set.seed(1)
trainTask1 <- makeClassifTask(data = PimaIndiansDiabetes,target = "diabetes",positive = "pos")
bagged.lrn = makeBaggingWrapper(makeLearner("classif.logreg"),
bw.iters = 100,
bw.replace = TRUE,
bw.size = .6,
bw.feats = .5)
bagged.lrn = setPredictType(bagged.lrn,"prob")
non.bagged.lrn = setPredictType(makeLearner("classif.logreg"),"prob")
rdesc = makeResampleDesc("CV", iters = 10L)
resample(learner = non.bagged.lrn,
task = trainTask1,
resampling = rdesc,
show.info = T,
measures = logloss)
resample(learner = bagged.lrn,
task = trainTask1,
resampling = rdesc,
show.info = T,
measures = logloss)
关键结果是
Resample Result Task: PimaIndiansDiabetes Learner: classif.logreg.bagged logloss.aggr: 1.65 logloss.mean: 1.65 logloss.sd: 0.90 Runtime: 14.0544