了解 xgboost 交叉验证和 AUC 输出结果

Understanding xgboost cross validation and AUC output results

我有以下 XGBoost C.V。型号。

xgboostModelCV <- xgb.cv(data =  dtrain, 
                             nrounds = 20, 
                             nfold = 3, 
                             metrics = "auc", 
                             verbose = TRUE, 
                             "eval_metric" = "auc",
                             "objective" = "binary:logistic", 
                             "max.depth" = 6, 
                             "eta" = 0.01,                               
                             "subsample" = 0.5, 
                             "colsample_bytree" = 1,
                             print_every_n = 1, 
                             "min_child_weight" = 1,
                             booster = "gbtree",
                             early_stopping_rounds = 10,
                             watchlist = watchlist,
                             seed = 1234)

我的问题是关于模型的输出和 nfold,我将 nfold 设置为 3

评估日志的输出如下所示;

   iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1     1      0.8852290  0.0023585703     0.8598630  0.005515424
2     2      0.9015413  0.0018569007     0.8792137  0.003765109
3     3      0.9081027  0.0014307577     0.8859040  0.005053600
4     4      0.9108463  0.0011838160     0.8883130  0.004324113
5     5      0.9130350  0.0008863908     0.8904100  0.004173123
6     6      0.9143187  0.0009514359     0.8910723  0.004372844
7     7      0.9151723  0.0010543653     0.8917300  0.003905284
8     8      0.9162787  0.0010344935     0.8929013  0.003582747
9     9      0.9173673  0.0010539116     0.8935753  0.003431949
10   10      0.9178743  0.0011498505     0.8942567  0.002955511
11   11      0.9182133  0.0010825702     0.8944377  0.003051411
12   12      0.9185767  0.0011846632     0.8946267  0.003026969
13   13      0.9186653  0.0013352629     0.8948340  0.002526793
14   14      0.9190500  0.0012537195     0.8954053  0.002636388
15   15      0.9192453  0.0010967155     0.8954127  0.002841402
16   16      0.9194953  0.0009818501     0.8956447  0.002783787
17   17      0.9198503  0.0009541517     0.8956400  0.002590862
18   18      0.9200363  0.0009890185     0.8957223  0.002580398
19   19      0.9201687  0.0010323405     0.8958790  0.002508695
20   20      0.9204030  0.0009725742     0.8960677  0.002581329

但是我设置了 nrounds = 20 但交叉验证 nfolds = 3 那么我应该输出 60 个结果而不是 20 个吗?

或者上面的输出就像列名所暗示的那样,每轮AUC的平均得分...

所以在 nround = 1 处,对于训练集,train_auc_mean 是结果 0.8852290,这将是 3 次交叉验证的平均值 nfolds?

因此,如果我绘制这些 AUC 分数,那么我将绘制 3 折交叉验证的平均 AUC 分数?

只是想确保一切都清楚。

你说得对,输出是折叠的平均值 auc。但是,如果您希望为 best/last 迭代提取单个折叠 auc,您可以按以下步骤进行:

使用来自 mlbench

的 Sonar 数据集的示例
library(xgboost)
library(tidyverse)
library(mlbench)

data(Sonar)

xgb.train.data <- xgb.DMatrix(as.matrix(Sonar[,1:60]), label = as.numeric(Sonar$Class)-1)
param <- list(objective = "binary:logistic")

xgb.cv 中设置 prediction = TRUE

model.cv <- xgb.cv(param = param,
                   data = xgb.train.data,
                   nrounds = 50,
                   early_stopping_rounds = 10,
                   nfold = 3,
                   prediction = TRUE,
                   eval_metric = "auc")

现在检查折叠并将预测与真实标签和相应索引联系起来:

z <- lapply(model.cv$folds, function(x){
  pred <- model.cv$pred[x]
  true <- (as.numeric(Sonar$Class)-1)[x]
  index <- x
  out <- data.frame(pred, true, index)
  out
})

给折叠命名:

names(z) <- paste("folds", 1:3, sep = "_")

z %>%
  bind_rows(.id = "id") %>%
  group_by(id) %>%
  summarise(auroc = roc(true, pred) %>%
           auc())
#output
# A tibble: 3 x 2
  id      auroc
  <chr>   <dbl>
1 folds_1 0.944
2 folds_2 0.900
3 folds_3 0.899

这些值的平均值与最佳迭代的平均值 auc 相同:

z %>%
  bind_rows(.id = "id") %>%
  group_by(id) %>%
  summarise(auroc = roc(true, pred) %>%
           auc()) %>%
  pull(auroc) %>%
  mean
#output
[1] 0.9143798

model.cv$evaluation_log[model.cv$best_iteration,]
#output
   iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1:   48              1             0       0.91438   0.02092817

你当然可以做更多的事情,比如为每一次折叠绘制 auc 曲线等等。