sbf() 是否使用度量参数来优化模型？

Question

将 ROC 作为 metric 参数值传递给 caretSBF 函数

我们的 objective 是使用 ROC 摘要指标进行模型选择，同时运行通过过滤选择 sbf() 功能进行特征选择。

BreastCancer 数据集被用作从 mlbench 包到运行 train() 和 sbf() 以及 metric = "Accuracy" 和 metric = "ROC"

我们要确保 sbf() 采用 metric 参数，正如 train() 和 rfe() 函数所应用的那样，以优化模型。为此，我们计划将 train() 函数与 sbf() 函数结合使用。 caretSBF$fit 函数调用 train()，并且 caretSBF 被传递给 sbfControl。

从输出来看，似乎 metric 参数仅用于 inner resampling 而不是用于 sbf 部分，即输出的 outer resampling， metric 参数未像 train() 和 rfe() 所使用的那样应用。

因为我们已经使用了 caretSBF，它使用 train()，看来 metric 参数的范围限于 train()，因此不会传递给 sbf。

我们希望澄清 sbf() 是否使用 metric 参数来优化模型，即 outer resampling?

这是我们关于可重现示例的工作，显示 train() 使用 metric 参数使用 Accuracy 和 ROC，但对于 sbf 我们不确定.

我。数据部分

  ## Loading required packages   
  library(mlbench)
  library(caret)

  ## Loading `BreastCancer` Dataset from *mlbench* package   
  data("BreastCancer")

  ## Data cleaning for missing values
  # Remove rows/observation with NA Values in any of the columns
  BrC1 <- BreastCancer[complete.cases(BreastCancer),] 

  # Removing Class and Id Column and keeping just Numeric Predictors
  Num_Pred <- BrC1[,2:10]

二．自定义汇总函数

定义 fiveStats 汇总函数

  fiveStats <- function(...) c(twoClassSummary(...),
                         defaultSummary(...))

三．火车区间

定义 trControl

  trCtrl <- trainControl(method="repeatedcv", number=10,
  repeats=1, classProbs = TRUE, summaryFunction = fiveStats)

火车 + 公制 = "Accuracy"

   set.seed(1)
   TR_acc <- train(Num_Pred,BrC1$Class, method="rf",metric="Accuracy",
   trControl = trCtrl,tuneGrid=expand.grid(.mtry=c(2,3,4,5)))

   TR_acc
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 615, 614, 614, 614, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9936532  0.9729798  0.9833333  0.9765772  0.9490311
   #   3     0.9936544  0.9729293  0.9791667  0.9750853  0.9457534
   #   4     0.9929957  0.9684343  0.9750000  0.9706948  0.9361373
   #   5     0.9922907  0.9684343  0.9666667  0.9677536  0.9295782
   # 
   # Accuracy was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 2.

火车 + 公制 = "ROC"

   set.seed(1)
   TR_roc <- train(Num_Pred,BrC1$Class, method="rf",metric="ROC",
   trControl = trCtrl,tuneGrid=expand.grid(.mtry=c(2,3,4,5)))
   TR_roc
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 615, 614, 614, 614, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9936532  0.9729798  0.9833333  0.9765772  0.9490311
   #   3     0.9936544  0.9729293  0.9791667  0.9750853  0.9457534
   #   4     0.9929957  0.9684343  0.9750000  0.9706948  0.9361373
   #   5     0.9922907  0.9684343  0.9666667  0.9677536  0.9295782
   # 
   # ROC was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 3.

四．编辑 caretSBF

编辑 caretSBF 摘要函数

   caretSBF$summary <- fiveStats

V。 SBF 部分

正在定义 sbfControl

   sbfCtrl <- sbfControl(functions=caretSBF, 
   method="repeatedcv", number=10, repeats=1,
   verbose=T, saveDetails = T)

SBF + 公制 = "Accuracy"

   set.seed(1)
   sbf_acc <- sbf(Num_Pred, BrC1$Class,
   sbfControl = sbfCtrl,
   trControl = trCtrl, method="rf", metric="Accuracy")

   ## sbf_acc  
   sbf_acc

   # Selection By Filter
   # 
   # Outer resampling method: Cross-Validated (10 fold, repeated 1 times) 
   # 
   # Resampling performance:
   # 
   #     ROC  Sens   Spec Accuracy Kappa    ROCSD SensSD  SpecSD AccuracySD  KappaSD
   #  0.9931 0.973 0.9833   0.9766 0.949 0.006272 0.0231 0.02913    0.01226 0.02646
   # 
   # Using the training set, 9 variables were selected:
   #    Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size...
   # 
   # During resampling, the top 5 selected variables (out of a possible 9):
   #    Bare.nuclei (100%), Bl.cromatin (100%), Cell.shape (100%), Cell.size (100%), Cl.thickness (100%)
   # 
   # On average, 9 variables were selected (min = 9, max = 9)

   ## Class of sbf_acc
   class(sbf_acc)
   # [1] "sbf"

   ## Names of elements of sbf_acc
   names(sbf_acc)
   #  [1] "pred"         "variables"    "results"      "fit"          "optVariables"
   #  [6] "call"         "control"      "resample"     "metrics"      "times"       
   # [11] "resampledCM"  "obsLevels"    "dots"        

   ## sbf_acc fit element*  
   sbf_acc$fit
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 614, 614, 615, 615, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9933176  0.9706566  0.9833333  0.9751492  0.9460717
   #   5     0.9920034  0.9662121  0.9791667  0.9707801  0.9363708
   #   9     0.9914825  0.9684343  0.9708333  0.9693308  0.9327662
   # 
   # Accuracy was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 2. 

   ##  Elements of sbf_acc fit  
   names(sbf_acc$fit)
   #  [1] "method"       "modelInfo"    "modelType"    "results"      "pred"        
   #  [6] "bestTune"     "call"         "dots"         "metric"       "control"     
   # [11] "finalModel"   "preProcess"   "trainingData" "resample"     "resampledCM" 
   # [16] "perfNames"    "maximize"     "yLimits"      "times"        "levels"      

   ## sbf_acc fit final Model
   sbf_acc$fit$finalModel

   # Call:
   #  randomForest(x = x, y = y, mtry = param$mtry) 
   #                Type of random forest: classification
   #                      Number of trees: 500
   # No. of variables tried at each split: 2
   # 
   #         OOB estimate of  error rate: 2.34%
   # Confusion matrix:
   #           benign malignant class.error
   # benign       431        13  0.02927928
   # malignant      3       236  0.01255230

   ## sbf_acc metric
   sbf_acc$fit$metric
   # [1] "Accuracy"

   ## sbf_acc fit best Tune*  
   sbf_acc$fit$bestTune
   #   mtry
   # 1    2

SBF + 公制 = "ROC"

   set.seed(1)
   sbf_roc <- sbf(Num_Pred, BrC1$Class,
   sbfControl = sbfCtrl,
   trControl = trCtrl, method="rf", metric="ROC")


   ## sbf_roc  
   sbf_roc

   # Selection By Filter
   # 
   # Outer resampling method: Cross-Validated (10 fold, repeated 1 times) 
   # 
   # Resampling performance:
   # 
   #     ROC  Sens   Spec Accuracy Kappa    ROCSD SensSD  SpecSD AccuracySD KappaSD
   #  0.9931 0.973 0.9833   0.9766 0.949 0.006272 0.0231 0.02913    0.01226 0.02646
   # 
   # Using the training set, 9 variables were selected:
   #    Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size...
   # 
   # During resampling, the top 5 selected variables (out of a possible 9):
   #    Bare.nuclei (100%), Bl.cromatin (100%), Cell.shape (100%), Cell.size (100%), Cl.thickness (100%)
   # 
   # On average, 9 variables were selected (min = 9, max = 9)

   ## Class of sbf_roc
   class(sbf_roc)
   # [1] "sbf"

   ## Names of elements of sbf_roc
   names(sbf_roc)
   #  [1] "pred"         "variables"    "results"      "fit"          "optVariables"
   #  [6] "call"         "control"      "resample"     "metrics"      "times"       
   # [11] "resampledCM"  "obsLevels"    "dots"        

   ## sbf_roc fit element*  
   sbf_roc$fit
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 614, 614, 615, 615, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9933176  0.9706566  0.9833333  0.9751492  0.9460717
   #   5     0.9920034  0.9662121  0.9791667  0.9707801  0.9363708
   #   9     0.9914825  0.9684343  0.9708333  0.9693308  0.9327662
   # 
   # ROC was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 2. 

   ##  Elements of sbf_roc fit  
   names(sbf_roc$fit)
   #  [1] "method"       "modelInfo"    "modelType"    "results"      "pred"        
   #  [6] "bestTune"     "call"         "dots"         "metric"       "control"     
   # [11] "finalModel"   "preProcess"   "trainingData" "resample"      "resampledCM" 
   # [16] "perfNames"    "maximize"     "yLimits"      "times"        "levels"      

   ## sbf_roc fit final Model
   sbf_roc$fit$finalModel

   # Call:
   #  randomForest(x = x, y = y, mtry = param$mtry) 
   #                Type of random forest: classification
   #                      Number of trees: 500
   # No. of variables tried at each split: 2
   # 
   #         OOB estimate of  error rate: 2.34%
   # Confusion matrix:
   #           benign malignant class.error
   # benign       431        13  0.02927928
   # malignant      3       236  0.01255230

   ## sbf_roc metric
   sbf_roc$fit$metric
   # [1] "ROC"

   ## sbf_roc fit best Tune  
   sbf_roc$fit$bestTune
   #   mtry
   # 1    2

sbf() 是否使用 metric 参数来优化模型？如果是，metric sbf() 默认使用什么？如果 sbf() 使用 metric 参数，那么如何将其设置为 ROC?

谢谢。

Answer 1

sbf 不使用度量来优化任何东西（不像 rfe）； sbf 所做的只是在调用模型之前执行特征选择步骤。当然，您定义了过滤器，但无法使用 sbf 调整过滤器，因此不需要指标来指导该步骤。

使用 sbf(x, y, metric = "ROC") 会将 metric = "ROC" 传递给您正在使用的任何建模函数（并且它设计为在使用 caretSBF 时与 train 一起使用。发生这种情况是因为sbf:

没有 metric 参数

> names(formals(caret:::sbf.default))
[1] "x"          "y"          "sbfControl" "..."

sbf() 是否使用度量参数来优化模型？

Does sbf() use metric argument to optimize model?

r

classification

machine-learning

rfe

r-caret