解释 R 中 xgboost 的 xgb.importance() 函数返回的 data.table

Interpreting the data.table returned by the xgb.importance() function of xgboost in R

我很难解释 xgboost 的 xgb.importance() 函数返回的 data.table,如果能帮助我理解此列背后的含义和直觉,我将不胜感激table.

为了使事情可重现和具体,我提供了以下代码:

library(data.table)
library(dplyr)
library(xgboost)

library(ISLR)

data(Auto)

Auto = Auto %>% mutate(

    origin = ifelse(origin == 2, 1, 0)

)

Auto = Auto %>% select(-name)

library(caTools)

split = sample.split(Auto$origin, SplitRatio = 0.80)

train = subset(Auto, split == TRUE)

test = subset(Auto, split == FALSE)

X_train = as.matrix(train %>% select(-origin))
X_test = as.matrix(test %>% select(-origin))
Y_train = train$origin
Y_test = test$origin

positive = sum(Y_train == 1)
negative = sum(Y_train == 0)
Total = length(Y_train)
weight = ifelse(Y_train == 1, Total/positive, Total/negative)


dtrain = xgb.DMatrix(data = X_train, label = Y_train )

dtest = xgb.DMatrix(data = X_test, label = Y_test)

model = xgb.train(data = dtrain, 

                                       verbose =2,  

                                       params = list(objective = "binary:logistic"), 

                                    weight = weight,

                                    nrounds = 20)

y_pred = predict(model, X_test)

table(y_pred > 0.5, Y_test)

important_variables = xgb.importance(model = model, feature_names = colnames(X_train), data = X_train, label = Y_train)

important_variables

dim(important_variables)

important_variable data.table 的第一行如下:

  Feature   Split   Gain    Cover   Frequency   RealCover   RealCover %
displacement    121.5   0.132621660 0.057075548 0.015075377 17  0.31481481
displacement    190.5   0.096984485 0.106824987 0.050251256 17  0.31481481
displacement    128 0.069083692 0.093517155 0.045226131 28  0.51851852
weight  2931.5  0.054731622 0.034017383 0.015075377 9   0.16666667
mpg 30.75   0.036373687 0.015353348 0.010050251 44  0.81481481
acceleration    19.8    0.030658707 0.043746304 0.015075377 50  0.92592593
displacement    169.5   0.028471073 0.035860862 0.020100503 20  0.37037037
displacement    113.5   0.028467685 0.017729564 0.020100503 27  0.50000000
horsepower  59  0.028450597 0.022879182 0.025125628 22  0.40740741
weight  2670.5  0.028335853 0.020309028 0.010050251 6   0.11111111
acceleration    15.6    0.022315984 0.026517622 0.015075377 51  0.94444444
weight  1947.5  0.020687204 0.003763738 0.005025126 7   0.12962963
acceleration    14.75   0.018458042 0.013565059 0.010050251 53  0.98148148
acceleration    19.65   0.018395565 0.006194124 0.010050251 53  0.98148148

根据文档:

列是:

Features name of the features as provided in feature_names or already present in the model dump;

Gain contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the label used for the training (only available for tree models);

Cover metric of the number of observation related to this feature (only available for tree models);

Weight percentage representing the relative number of times a feature have been taken into trees.

虽然 FeatureGain 具有明显的含义,但列 CoverFrequencyRealCoverRealCover% 对我来说很难解释。

在 table important_variables 的第一行,我们得知 displacement 有:

试图破译这些数字的含义我运行下面的代码:

train %>% filter(displacement > 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))

Count   Frequency
190 0.6070288
#
train %>% filter(displacement > 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))

origin  Count   Frequency
0   183 0.58466454
1   7   0.02236422
#
train %>% filter(displacement < 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))

Count   Frequency
123 0.3929712
#
train %>% filter(displacement < 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))

origin  Count   Frequency
0   76  0.2428115
1   47  0.1501597
#

尽管如此,我还是一头雾水

我们将不胜感激您的建议。

频率是涉及特定特征的拆分相对于每次拆分的百分比。您可以通过观察所有变量的频率总和为 1 来进行健全性检查。

sum(important_variables$Frequency)  
[1] 1  

它显示了一个特征被选择的次数。虽然不像 Gain 那样复杂,但它也可以用作变量重要性度量。

这也解释了为什么你不能通过对训练数据进行汇总操作来获得相同的频率数:它是在训练好的xgboost模型上计算的;不是数据。

Cover 及其派生词并不那么简单。请参阅 的答案以获得详细答案。