了解 xgboost 交叉验证和 AUC 输出结果
Understanding xgboost cross validation and AUC output results
我有以下 XGBoost C.V。型号。
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 20,
nfold = 3,
metrics = "auc",
verbose = TRUE,
"eval_metric" = "auc",
"objective" = "binary:logistic",
"max.depth" = 6,
"eta" = 0.01,
"subsample" = 0.5,
"colsample_bytree" = 1,
print_every_n = 1,
"min_child_weight" = 1,
booster = "gbtree",
early_stopping_rounds = 10,
watchlist = watchlist,
seed = 1234)
我的问题是关于模型的输出和 nfold
,我将 nfold
设置为 3
评估日志的输出如下所示;
iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1 1 0.8852290 0.0023585703 0.8598630 0.005515424
2 2 0.9015413 0.0018569007 0.8792137 0.003765109
3 3 0.9081027 0.0014307577 0.8859040 0.005053600
4 4 0.9108463 0.0011838160 0.8883130 0.004324113
5 5 0.9130350 0.0008863908 0.8904100 0.004173123
6 6 0.9143187 0.0009514359 0.8910723 0.004372844
7 7 0.9151723 0.0010543653 0.8917300 0.003905284
8 8 0.9162787 0.0010344935 0.8929013 0.003582747
9 9 0.9173673 0.0010539116 0.8935753 0.003431949
10 10 0.9178743 0.0011498505 0.8942567 0.002955511
11 11 0.9182133 0.0010825702 0.8944377 0.003051411
12 12 0.9185767 0.0011846632 0.8946267 0.003026969
13 13 0.9186653 0.0013352629 0.8948340 0.002526793
14 14 0.9190500 0.0012537195 0.8954053 0.002636388
15 15 0.9192453 0.0010967155 0.8954127 0.002841402
16 16 0.9194953 0.0009818501 0.8956447 0.002783787
17 17 0.9198503 0.0009541517 0.8956400 0.002590862
18 18 0.9200363 0.0009890185 0.8957223 0.002580398
19 19 0.9201687 0.0010323405 0.8958790 0.002508695
20 20 0.9204030 0.0009725742 0.8960677 0.002581329
但是我设置了 nrounds = 20
但交叉验证 nfolds
= 3 那么我应该输出 60 个结果而不是 20 个吗?
或者上面的输出就像列名所暗示的那样,每轮AUC的平均得分...
所以在 nround = 1
处,对于训练集,train_auc_mean
是结果 0.8852290
,这将是 3 次交叉验证的平均值 nfolds
?
因此,如果我绘制这些 AUC 分数,那么我将绘制 3 折交叉验证的平均 AUC 分数?
只是想确保一切都清楚。
你说得对,输出是折叠的平均值 auc
。但是,如果您希望为 best/last 迭代提取单个折叠 auc,您可以按以下步骤进行:
使用来自 mlbench
的 Sonar 数据集的示例
library(xgboost)
library(tidyverse)
library(mlbench)
data(Sonar)
xgb.train.data <- xgb.DMatrix(as.matrix(Sonar[,1:60]), label = as.numeric(Sonar$Class)-1)
param <- list(objective = "binary:logistic")
在 xgb.cv
中设置 prediction = TRUE
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 50,
early_stopping_rounds = 10,
nfold = 3,
prediction = TRUE,
eval_metric = "auc")
现在检查折叠并将预测与真实标签和相应索引联系起来:
z <- lapply(model.cv$folds, function(x){
pred <- model.cv$pred[x]
true <- (as.numeric(Sonar$Class)-1)[x]
index <- x
out <- data.frame(pred, true, index)
out
})
给折叠命名:
names(z) <- paste("folds", 1:3, sep = "_")
z %>%
bind_rows(.id = "id") %>%
group_by(id) %>%
summarise(auroc = roc(true, pred) %>%
auc())
#output
# A tibble: 3 x 2
id auroc
<chr> <dbl>
1 folds_1 0.944
2 folds_2 0.900
3 folds_3 0.899
这些值的平均值与最佳迭代的平均值 auc 相同:
z %>%
bind_rows(.id = "id") %>%
group_by(id) %>%
summarise(auroc = roc(true, pred) %>%
auc()) %>%
pull(auroc) %>%
mean
#output
[1] 0.9143798
model.cv$evaluation_log[model.cv$best_iteration,]
#output
iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1: 48 1 0 0.91438 0.02092817
你当然可以做更多的事情,比如为每一次折叠绘制 auc 曲线等等。
我有以下 XGBoost C.V。型号。
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 20,
nfold = 3,
metrics = "auc",
verbose = TRUE,
"eval_metric" = "auc",
"objective" = "binary:logistic",
"max.depth" = 6,
"eta" = 0.01,
"subsample" = 0.5,
"colsample_bytree" = 1,
print_every_n = 1,
"min_child_weight" = 1,
booster = "gbtree",
early_stopping_rounds = 10,
watchlist = watchlist,
seed = 1234)
我的问题是关于模型的输出和 nfold
,我将 nfold
设置为 3
评估日志的输出如下所示;
iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1 1 0.8852290 0.0023585703 0.8598630 0.005515424
2 2 0.9015413 0.0018569007 0.8792137 0.003765109
3 3 0.9081027 0.0014307577 0.8859040 0.005053600
4 4 0.9108463 0.0011838160 0.8883130 0.004324113
5 5 0.9130350 0.0008863908 0.8904100 0.004173123
6 6 0.9143187 0.0009514359 0.8910723 0.004372844
7 7 0.9151723 0.0010543653 0.8917300 0.003905284
8 8 0.9162787 0.0010344935 0.8929013 0.003582747
9 9 0.9173673 0.0010539116 0.8935753 0.003431949
10 10 0.9178743 0.0011498505 0.8942567 0.002955511
11 11 0.9182133 0.0010825702 0.8944377 0.003051411
12 12 0.9185767 0.0011846632 0.8946267 0.003026969
13 13 0.9186653 0.0013352629 0.8948340 0.002526793
14 14 0.9190500 0.0012537195 0.8954053 0.002636388
15 15 0.9192453 0.0010967155 0.8954127 0.002841402
16 16 0.9194953 0.0009818501 0.8956447 0.002783787
17 17 0.9198503 0.0009541517 0.8956400 0.002590862
18 18 0.9200363 0.0009890185 0.8957223 0.002580398
19 19 0.9201687 0.0010323405 0.8958790 0.002508695
20 20 0.9204030 0.0009725742 0.8960677 0.002581329
但是我设置了 nrounds = 20
但交叉验证 nfolds
= 3 那么我应该输出 60 个结果而不是 20 个吗?
或者上面的输出就像列名所暗示的那样,每轮AUC的平均得分...
所以在 nround = 1
处,对于训练集,train_auc_mean
是结果 0.8852290
,这将是 3 次交叉验证的平均值 nfolds
?
因此,如果我绘制这些 AUC 分数,那么我将绘制 3 折交叉验证的平均 AUC 分数?
只是想确保一切都清楚。
你说得对,输出是折叠的平均值 auc
。但是,如果您希望为 best/last 迭代提取单个折叠 auc,您可以按以下步骤进行:
使用来自 mlbench
library(xgboost)
library(tidyverse)
library(mlbench)
data(Sonar)
xgb.train.data <- xgb.DMatrix(as.matrix(Sonar[,1:60]), label = as.numeric(Sonar$Class)-1)
param <- list(objective = "binary:logistic")
在 xgb.cv
中设置 prediction = TRUE
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 50,
early_stopping_rounds = 10,
nfold = 3,
prediction = TRUE,
eval_metric = "auc")
现在检查折叠并将预测与真实标签和相应索引联系起来:
z <- lapply(model.cv$folds, function(x){
pred <- model.cv$pred[x]
true <- (as.numeric(Sonar$Class)-1)[x]
index <- x
out <- data.frame(pred, true, index)
out
})
给折叠命名:
names(z) <- paste("folds", 1:3, sep = "_")
z %>%
bind_rows(.id = "id") %>%
group_by(id) %>%
summarise(auroc = roc(true, pred) %>%
auc())
#output
# A tibble: 3 x 2
id auroc
<chr> <dbl>
1 folds_1 0.944
2 folds_2 0.900
3 folds_3 0.899
这些值的平均值与最佳迭代的平均值 auc 相同:
z %>%
bind_rows(.id = "id") %>%
group_by(id) %>%
summarise(auroc = roc(true, pred) %>%
auc()) %>%
pull(auroc) %>%
mean
#output
[1] 0.9143798
model.cv$evaluation_log[model.cv$best_iteration,]
#output
iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1: 48 1 0 0.91438 0.02092817
你当然可以做更多的事情,比如为每一次折叠绘制 auc 曲线等等。