mlr：比较 getFilteredFeatures 和 generateFilterValuesData 的输出时出现意外结果

Question

我使用两种不同的方法来获取过滤器 select 编辑的特征。我希望这些方法 return 具有相同的值，但它们没有这样做，我不明白为什么。使用第二种方法的原因是我可以访问用于 select 特征的分数，而不仅仅是 selected 特征的名称。

该过滤器是一个单变量模型评分过滤器，使用 Cox 模型来衡量性能，select使用前 5 个特征。我创建了一个 resample 实例，这样两种方法在每次折叠中都使用相同的样本。

第一种方法是常用方法 - 使用 makeFilterWrapper 将过滤器包裹在套索模型周围，并通过 resample 的提取选项调用 getFilteredFeatures。我的理解是 getFilteredFeatures returns 是 select 由过滤器编辑的特征，在传递给套索模型之前。

在第二种方法中，我使用 subsetTask 创建与 getFilteredFeatures 将在每个 CV 折叠中使用的子任务相同的子任务，然后我调用 generateFilterValuesData 来获取过滤器生成的值。每个折叠中此列表中的前 5 个值应与每个折叠中从 getFilteredFeatures 编辑的值 return 匹配，但它们不匹配。这是为什么？

library(survival)
#> Warning: package 'survival' was built under R version 3.5.3
library(mlr)
#> Loading required package: ParamHelpers

data(veteran)
task_id = "VET"
vet.task <- makeSurvTask(id = task_id, data = veteran, target = c("time", "status"))
vet.task <- createDummyFeatures(vet.task)

inner = makeResampleDesc("CV", iters=2, stratify=TRUE)  # Tuning
outer = makeResampleDesc("CV", iters=2, stratify=TRUE)  # Benchmarking

set.seed(24601)
resinst = makeResampleInstance(desc=outer, task=vet.task)

cox.lrn <- makeLearner(cl="surv.coxph", id = "coxph", predict.type="response")
lasso.lrn  <- makeLearner(cl="surv.cvglmnet", id = "lasso", predict.type="response", alpha = 1, nfolds=5)

filt.uni.lrn = 
  makeFilterWrapper(
    lasso.lrn, 
    fw.method="univariate.model.score", 
    perf.learner=cox.lrn,
    fw.abs = 5
  )
#Method 1
res = resample(learner=filt.uni.lrn, task=vet.task, resampling=resinst, measures=list(cindex), extract=getFilteredFeatures)
#> Resampling: cross-validation
#> Measures:             cindex
#> [Resample] iter 1:    0.7458904
#> [Resample] iter 2:    0.6575813
#> 
#> Aggregated Result: cindex.test.mean=0.7017359
#> 
res$extract
#> [[1]]
#> [1] "karno"              "diagtime"           "celltype.squamous" 
#> [4] "celltype.smallcell" "celltype.adeno"    
#> 
#> [[2]]
#> [1] "karno"              "diagtime"           "age"               
#> [4] "celltype.smallcell" "celltype.large"

#Method 2
for (i in 1:2) {
  subt = subsetTask(task=vet.task, subset = resinst$train.inds[[i]])
  print(generateFilterValuesData(subt, method="univariate.model.score", perf.learner=cox.lrn))
}
#> FilterValues:
#> Task: VET
#>                 name    type                 method     value
#> 2              karno numeric univariate.model.score 0.6387665
#> 7 celltype.smallcell numeric univariate.model.score 0.6219512
#> 8     celltype.adeno numeric univariate.model.score 0.5700000
#> 5              prior numeric univariate.model.score 0.5456522
#> 6  celltype.squamous numeric univariate.model.score 0.5316206
#> 4                age numeric univariate.model.score 0.5104603
#> 1                trt numeric univariate.model.score 0.5063830
#> 3           diagtime numeric univariate.model.score 0.4760956
#> 9     celltype.large numeric univariate.model.score 0.3766520
#> FilterValues:
#> Task: VET
#>                 name    type                 method     value
#> 2              karno numeric univariate.model.score 0.6931330
#> 9     celltype.large numeric univariate.model.score 0.6264822
#> 7 celltype.smallcell numeric univariate.model.score 0.5269058
#> 6  celltype.squamous numeric univariate.model.score 0.5081967
#> 8     celltype.adeno numeric univariate.model.score 0.5064655
#> 4                age numeric univariate.model.score 0.4980237
#> 1                trt numeric univariate.model.score 0.4646018
#> 3           diagtime numeric univariate.model.score 0.4547619
#> 5              prior numeric univariate.model.score 0.4527897

^{由 reprex package (v0.3.0)}

于 2019 年 10 月 2 日创建

Answer 1

你在这里混淆了两件事。

案例 1（嵌套重采样）

在嵌套重采样的外部折叠中选择的特征是根据内部重采样的最佳性能折叠确定的。

折叠 1（内部）-> 使用过滤器计算前 5 个特征-> 计算模型性能
折叠 2（内部）-> 使用过滤器计算前 5 个特征-> 计算模型性能
检查哪个内部折叠具有最佳性能（假设折叠 1）-> 从该折叠中取出前 5 个特征用于外部折叠中的模型拟合

因此，过滤值实际上不会在外部折叠上计算，而只会在内部折叠上计算。你基本上是在问 "give me the top 5 features according to the filter from the inner loop and only train the model on these in the outer fold"。因为过滤器值不会在外部折叠中再次重新计算，所以您只能返回特征名称而没有值。

案例二（直接计算过滤值）

在这里，您直接在两个外部折叠上生成过滤器值。由于观察结果与嵌套重采样（案例 1）中的内部折叠不同，您的套索学习器将得出不同的过滤器分数（模型拟合发生在不同的观察结果上）并且可能有不同的排名。

IIUC 你的想法是在嵌套重采样设置中为每个外部折叠再次生成过滤器值。情况并非如此，也没有任何好处，因为在内部折叠的优化过程中已经选择了适合模型的特征。

对于外部折叠，仅使用内部循环建议的选定特征对模型进行训练。同样的逻辑也适用于调整："Give me the best hyperparameters across all folds from the inner loop (I'll tell you how to do so) and then fit a model on the outer fold using these settings"。

也许它有助于将此逻辑转移到调优：您也不会在每个外部折叠上独立调用 tuneParams() 并假设您得到与内部嵌套重采样优化会产生的相同的超参数返回，你不会吗？

mlr：比较 getFilteredFeatures 和 generateFilterValuesData 的输出时出现意外结果

mlr: Unexpected result when comparing output from getFilteredFeatures and generateFilterValuesData

r

mlr

案例 1（嵌套重采样）

案例二（直接计算过滤值）