{caret}xgTree：重新采样的性能指标中存在缺失值

Question

我正尝试在 this dataset 上运行一个 5 倍 XGBoost 模型。当我运行以下代码时：

  train_control<- trainControl(method="cv", 
                           search = "random", 
                           number=5,
                           verboseIter=TRUE)

  # Train Models 
  xgb.mod<- train(Vote_perc~.,
              data=forkfold, 
              trControl=train_control, 
              method="xgbTree", 
              family=binomial())

我收到以下警告：

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

此外，"predict"函数运行s，但所有预测都是相同的数字。我怀疑这是一个仅拦截模型，但我不确定。此外，当我删除

search="random"

参数，运行正确。我想运行随机搜索，以便我可以隔离哪些超参数可能最有效，但每次尝试时，我都会收到警告。我错过了什么？谢谢！

Answer 1

这是您可以对数据执行的一种方法：

加载数据：

forkfold  <- read.csv("forkfold.csv", row.names = 1)

这里的问题是，在 97% 的情况下，结果变量为 0，而在其余 3% 的情况下，它非常接近于零。

length(forkfold$Vote_perc)
#output
7069

sum(forkfold$Vote_perc != 0)
#output 
212

您将其描述为分类问题，我将通过将其转换为二元问题来对待它：

forkfold$Vote_perc <- ifelse(forkfold$Vote_perc != 0,
                             "one",
                             "zero")

由于集合高度不平衡，使用 Accuracy 作为选择指标是不可能的。在这里，我将尝试通过定义自定义评估函数来最大化 Sensitivity + Specificity，如 here 所述：

fourStats <- function (data, lev = levels(data$obs), model = NULL) {
  out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
  coords <- matrix(c(1, 1, out["Spec"], out["Sens"]), 
                   ncol = 2, 
                   byrow = TRUE)
  colnames(coords) <- c("Spec", "Sens")
  rownames(coords) <- c("Best", "Current")
  c(out, Dist = dist(coords)[1])
}

我会在trainControl中指定这个函数:

train_control <- trainControl(method = "cv", 
                              search = "random", 
                              number = 5,
                              verboseIter=TRUE,
                              classProbs = T,
                              savePredictions = "final",
                              summaryFunction = fourStats)

set.seed(1)
xgb.mod <- train(Vote_perc~.,
                 data = forkfold, 
                 trControl = train_control, 
                 method = "xgbTree", 
                 tuneLength = 50,
                 metric = "Dist",
                 maximize = FALSE,
                 scale_pos_weight = sum(forkfold$Vote_perc == "zero")/sum(forkfold$Vote_perc == "one"))

我将在 fourStats 汇总函数中使用之前定义的 Dist 指标。该指标应最小化，因此 maximize = FALSE。我将对曲调 space 使用随机搜索，并将测试 50 组随机超参数值 (tuneLength = 50)。

我还设置了 xgboost 函数的 scale_pos_weight 参数。来自?xgboost的帮助：

scale_pos_weight, [default=1] Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases) See Parameters Tuning for more discussion. Also see Higgs Kaggle competition demo for examples: R, py1, py2, py3

我按照推荐定义的sum(negative cases) / sum(positive cases)

模型训练后，它会选择一些炒作参数来最小化 Dist。

评估保留预测的混淆矩阵：

caret::confusionMatrix(xgb.mod$pred$pred, xgb.mod$pred$obs)

Confusion Matrix and Statistics

          Reference
Prediction  one zero
      one   195  430
      zero   17 6427

               Accuracy : 0.9368          
                 95% CI : (0.9308, 0.9423)
    No Information Rate : 0.97            
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.4409          
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.91981         
            Specificity : 0.93729         
         Pos Pred Value : 0.31200         
         Neg Pred Value : 0.99736         
             Prevalence : 0.02999         
         Detection Rate : 0.02759         
   Detection Prevalence : 0.08841         
      Balanced Accuracy : 0.92855         

       'Positive' Class : one

我会说还不错。

如果调整预测的截止阈值，您可以做得更好，描述了如何在调整过程中执行此操作 here。您还可以使用折叠外预测来调整截止阈值。在这里我将展示如何使用 pROC 库：

library(pROC)

plot(roc(xgb.mod$pred$obs, xgb.mod$pred$one),
     print.thres = TRUE)

图像上显示的阈值最大化 Sens + Spec:

使用此阈值评估折叠性能：

caret::confusionMatrix(ifelse(xgb.mod$pred$one > 0.369, "one", "zero"),
                       xgb.mod$pred$obs)
#output
Confusion Matrix and Statistics

          Reference
Prediction  one zero
      one   200  596
      zero   12 6261

               Accuracy : 0.914           
                 95% CI : (0.9072, 0.9204)
    No Information Rate : 0.97            
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.3668          
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.94340         
            Specificity : 0.91308         
         Pos Pred Value : 0.25126         
         Neg Pred Value : 0.99809         
             Prevalence : 0.02999         
         Detection Rate : 0.02829         
   Detection Prevalence : 0.11260         
      Balanced Accuracy : 0.92824         

       'Positive' Class : one

因此，在 212 个非零实体中，您检测到 200 个。

为了更好地执行，您可以尝试预处理数据。或者使用更好的超参数搜索例程，如 mlrMBO package intended for use with mlr。或者也许改变学习者（我怀疑你能在这里超越 xgboost）。

请注意，如果获得高灵敏度不是最重要的，也许使用 "Kappa" 作为选择指标可能会提供更令人满意的模型。

作为最后的说明，让我们使用已选择的参数检查默认 scale_pos_weight = 1 模型的性能：

set.seed(1)
xgb.mod2 <- train(Vote_perc~.,
                  data = forkfold, 
                  trControl = train_control, 
                  method = "xgbTree", 
                  tuneGrid = data.frame(nrounds = 498,
                                        max_depth = 3,
                                        eta = 0.008833468,
                                        gamma = 4.131242,
                                        colsample_bytree = 0.4233169,
                                        min_child_weight = 3,
                                        subsample = 0.6212512),
                  metric = "Dist",
                  maximize = FALSE,
                  scale_pos_weight = 1)

caret::confusionMatrix(xgb.mod2$pred$pred, xgb.mod2$pred$obs)
#output
Confusion Matrix and Statistics

          Reference
Prediction  one zero
      one    94   21
      zero  118 6836

               Accuracy : 0.9803          
                 95% CI : (0.9768, 0.9834)
    No Information Rate : 0.97            
    P-Value [Acc > NIR] : 3.870e-08       

                  Kappa : 0.5658          
 Mcnemar's Test P-Value : 3.868e-16       

            Sensitivity : 0.44340         
            Specificity : 0.99694         
         Pos Pred Value : 0.81739         
         Neg Pred Value : 0.98303         
             Prevalence : 0.02999         
         Detection Rate : 0.01330         
   Detection Prevalence : 0.01627         
      Balanced Accuracy : 0.72017         

       'Positive' Class : one

在默认阈值 0.5 下更糟。

最佳阈值：

plot(roc(xgb.mod2$pred$obs, xgb.mod2$pred$one),
     print.thres = TRUE)

0.037 与我们按照建议设置 scale_pos_weight 时获得的 0.369 相比。然而，在最佳阈值的情况下，两种方法都会产生相同的预测。

{caret}xgTree：重新采样的性能指标中存在缺失值

{caret}xgTree: There were missing values in resampled performance measures

r

hyperparameters

r-caret

xgboost