XGBoost 决策树选择

Question

我有一个关于我应该从哪个决策树中选择的问题 XGBoost。

我将使用以下代码作为示例。

#import packages
import xgboost as xgb
import matplotlib.pyplot as plt

# create DMatrix
df_dmatrix = xgb.DMatrix(data = X, label = y)

# set up parameter dictionary
params = {"objective":"reg:linear", "max_depth":2}

#train the model
xg_reg = xgb.train(params = params, dtrain = df_dmatrix, num_boost_round = 10)

#plot the tree
xgb.plot_tree(xg_reg, num_trees = n) # my question related to here

我在 xg_reg 模型中创建了 10 棵树，我可以通过在我最后的代码中设置 n 等于树的索引来绘制其中任何一棵树。

我的问题是：我如何知道哪棵树最能解释数据集？总是最后一个吗？或者我应该确定我想在树中包含哪些特征，然后选择包含这些特征的树？

Answer 1

My question is how I can know which tree explains the data set best?

XGBoost 是梯度提升决策树 (GBDT) 的一种实现。粗略地说，GBDT 是一系列树，每棵树都使用残差提升来改进前一棵树的预测。所以最能解释数据的树是第 n - 1。

您可以阅读有关 GBDT 的更多信息here

Or should I determine which features I want to include in the tree, and then choose the tree which contains the features?

所有树都使用相同的基本特征进行训练，只是在每次增强迭代时添加 residuals。所以你不能用这种方式确定最好的树。在这个video中有对残差的直观解释。

XGBoost 决策树选择

XGBoost decision tree selection

python

decision-tree

xgboost