mlr中如何联合使用makeFeatSelWrapper和resample函数

How to jointly use makeFeatSelWrapper and resample function in mlr

我正在使用 R 中的 MLR 包为二元问题拟合分类模型。对于每个模型,我使用 "selectFeatures" 函数执行带有嵌入式特征选择的交叉验证。在输出中,我检索了测试集和预测的平均 AUC。为此,在获得一些建议 (Get predictions on test sets in MLR) 后,我将 "makeFeatSelWrapper" 函数与 "resample" 函数结合使用。目标似乎达到了,但结果却很奇怪。使用逻辑回归作为分类器,我得到的 AUC 为 0.5,这意味着没有选择变量。这个结果是出乎意料的,因为我使用链接问题中提到的方法使用此分类器获得了 0.9824432 的 AUC。使用神经网络作为分类器时,我收到一条错误消息

Error in sum(x) : invalid 'type' (list) of argument

怎么了?

示例代码如下:

# 1. Find a synthetic dataset for supervised learning (two classes)
###################################################################

install.packages("mlbench")
library(mlbench)
data(BreastCancer)

# generate 1000 rows, 21 quantitative candidate predictors and 1 target variable 
p<-mlbench.waveform(1000) 

# convert list into dataframe
dataset<-as.data.frame(p)

# drop thrid class to get 2 classes
dataset2  = subset(dataset, classes != 3)

# 2. Perform cross validation with embedded feature selection using logistic regression
#######################################################################################  

library(BBmisc)
library(nnet)
library(mlr)

# Choice of data 
mCT <- makeClassifTask(data =dataset2, target = "classes")

# Choice of algorithm i.e. neural network
mL <- makeLearner("classif.logreg", predict.type = "prob")

# Choice of cross-validations for folds 

outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)

# Choice of feature selection method

ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)

# Choice of hold-out sampling between training and test within the fold

inner = makeResampleDesc("Holdout",stratify = TRUE)

lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)

# 3. Perform cross validation with embedded feature selection using neural network
##################################################################################

library(BBmisc)
library(nnet)
library(mlr)

# Choice of data 
mCT <- makeClassifTask(data =dataset2, target = "classes")

# Choice of algorithm i.e. neural network
mL <- makeLearner("classif.nnet", predict.type = "prob")

# Choice of cross-validations for folds 

outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)

# Choice of feature selection method

ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)

# Choice of sampling between training and test within the fold

inner = makeResampleDesc("Holdout",stratify = TRUE)

lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)

如果您 运行 代码的逻辑回归部分多次,您应该还会遇到 Error in sum(x) : invalid 'type' (list) of argument 错误。但是,我觉得奇怪的是,在重采样之前修复特定的种子(例如 set.seed(1))并不能确保错误出现或不出现。

错误发生在内部 mlr 代码中,用于将功能选择的输出打印到控制台。一个非常简单的解决方法是简单地避免在 makeFeatSelWrapper 中使用 show.info = FALSE 打印此类输出(请参见下面的代码)。虽然这消除了错误,但可能导致它的原因可能会产生其他后果,尽管我可能错误只影响打印代码。

当运行使用你的代码时,我只得到 0.90 以上的 AUC。请在下面找到您的逻辑回归代码,稍微重新组织并使用解决方法。我已将 droplevels() 添加到 dataset2 以从因子中删除缺失的级别 3,尽管这与解决方法无关。

library(mlbench)
library(mlr)
data(BreastCancer)

p<-mlbench.waveform(1000)
dataset<-as.data.frame(p)
dataset2  = subset(dataset, classes != 3)
dataset2  <- droplevels(dataset2  )    

mCT <- makeClassifTask(data =dataset2, target = "classes")
ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)
mL <- makeLearner("classif.logreg", predict.type = "prob")
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl, show.info = FALSE)
# uncomment this for the error to appear again. Might need to run the code a couple of times to see the error
# lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)

编辑:我报告了一个 issue and created a pull request 修复。