解释由 R 中的 xgboost 中的 xgb.create.features() 函数创建的特征

Interpreting features created by xgb.create.features() function in xgboost in R

如何解释 R 中 xgboost 包中 xgb.create.features() 创建的特征?

这是一个可重现的例子:

library(xgboost)

data(mtcars)
X = as.matrix(mtcars[, -9])
dtrain = xgb.DMatrix(data = X, label = Y)

model = xgb.train(data = dtrain, 
                  eval = "auc",
                  verbose =0,  maximize = TRUE, 
                  params = list(objective = "binary:logistic",
                                eta = 0.1,
                                max_depth = 6,
                                subsample = 0.8,
                                lambda = 0.1 ), 
                  nrounds = 10)

dtrain1 = xgb.create.features(model, X)
colnames(dtrain1)

'mpg' 'cyl' 'disp' 'hp' 'drat' 'wt' 'qsec' 'vs' 'gear' 'carb' 'V13' 'V14' 'V15' 'V16' 'V23' 'V24' 'V33' 'V34' 'V43' 'V44' 'V53' 'V54' 'V63' 'V64' 'V73' 'V74' 'V83' 'V84' 'V93' 'V94' 'V103' 'V104'

new_data = as.matrix(dtrain1)
new_data = data.frame(new_data)
head(new_data)

您安装了 10 棵树。这 10 棵树的叶子数量与列 V13 - V104 的数量相同。这些叶子是你的新变量。

假设第一棵树有 4 片叶子,Observation Mazda RX4 落在第 2 片叶子上,编码为 0、1、0、0。对应的变量为 V13、V14、V15、V16 .第二棵树等也是如此。

根据变量名可以得出哪些列对应哪些树:
'V13' 'V14' 'V15' 'V16' - 第一棵树
'V23' 'V24' - 第二棵树
'V103' 'V104' - 第 10 棵树

如函数帮助中所述:

We found that boosted decision trees are a powerful and very convenient way to implement non-linear and tuple transformations of the kind we just described. We treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in. We use 1-of-K coding of this type of features.

For example, consider the boosted tree model in Figure 1 with 2 subtrees, where the first subtree has 3 leafs and the second 2 leafs. If an instance ends up in leaf 2 in the first subtree and leaf 1 in second subtree, the overall input to the linear classifier will be the binary vector \code{[0, 1, 0, 1, 0]}, where the first 3 entries correspond to the leaves of the first subtree and last 2 to those of the second subtree.

请注意,此变量集需要另一个超参数调整,并且容易 over-fitting。查看 xgb.create.features.

前后模型的特征重要性