XGBoost 中的特征重要性 'gain'

Question

我想了解 xgboost 中的特征重要性是如何通过 'gain' 计算的。来自 https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7:

‘Gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).

在scikit-learn中，特征重要性是通过gini计算的impurity/information使用变量分裂后每个节点的增益减少，即节点的加权杂质平均值-左子节点的加权杂质平均值-加权杂质右子节点的平均值（另请参阅：https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting）

我想知道 xgboost 是否也使用上面引用中所述的信息增益或准确性的这种方法。我试图深入挖掘xgboost的代码，发现了这个方法（已经删掉了不相关的部分）：

def get_score(self, fmap='', importance_type='gain'):
    trees = self.get_dump(fmap, with_stats=True)

    importance_type += '='
    fmap = {}
    gmap = {}
    for tree in trees:
        for line in tree.split('\n'):
            # look for the opening square bracket
            arr = line.split('[')
            # if no opening bracket (leaf node), ignore this line
            if len(arr) == 1:
                continue

            # look for the closing bracket, extract only info within that bracket
            fid = arr[1].split(']')

            # extract gain or cover from string after closing bracket
            g = float(fid[1].split(importance_type)[1].split(',')[0])

            # extract feature name from string before closing bracket
            fid = fid[0].split('<')[0]

            if fid not in fmap:
                # if the feature hasn't been seen yet
                fmap[fid] = 1
                gmap[fid] = g
            else:
                fmap[fid] += 1
                gmap[fid] += g

    return gmap

所以 'gain' 是从每个助推器的转储文件中提取的，但它是如何实际测量的？

Answer 1

问得好。使用以下公式计算增益：

如需深入解释，请阅读：https://xgboost.readthedocs.io/en/latest/tutorials/model.html

XGBoost 中的特征重要性 'gain'

Feature importance 'gain' in XGBoost

python

scikit-learn

boosting

xgboost

information-gain