XGBoost 中的特征重要性 'gain'
Feature importance 'gain' in XGBoost
我想了解 xgboost 中的特征重要性是如何通过 'gain' 计算的。来自 https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7:
‘Gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).
在scikit-learn中,特征重要性是通过gini计算的impurity/information使用变量分裂后每个节点的增益减少,即节点的加权杂质平均值-左子节点的加权杂质平均值-加权杂质右子节点的平均值(另请参阅:https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting)
我想知道 xgboost 是否也使用上面引用中所述的信息增益或准确性的这种方法。我试图深入挖掘xgboost的代码,发现了这个方法(已经删掉了不相关的部分):
def get_score(self, fmap='', importance_type='gain'):
trees = self.get_dump(fmap, with_stats=True)
importance_type += '='
fmap = {}
gmap = {}
for tree in trees:
for line in tree.split('\n'):
# look for the opening square bracket
arr = line.split('[')
# if no opening bracket (leaf node), ignore this line
if len(arr) == 1:
continue
# look for the closing bracket, extract only info within that bracket
fid = arr[1].split(']')
# extract gain or cover from string after closing bracket
g = float(fid[1].split(importance_type)[1].split(',')[0])
# extract feature name from string before closing bracket
fid = fid[0].split('<')[0]
if fid not in fmap:
# if the feature hasn't been seen yet
fmap[fid] = 1
gmap[fid] = g
else:
fmap[fid] += 1
gmap[fid] += g
return gmap
所以 'gain' 是从每个助推器的转储文件中提取的,但它是如何实际测量的?
问得好。使用以下公式计算增益:
如需深入解释,请阅读:https://xgboost.readthedocs.io/en/latest/tutorials/model.html
我想了解 xgboost 中的特征重要性是如何通过 'gain' 计算的。来自 https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7:
‘Gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).
在scikit-learn中,特征重要性是通过gini计算的impurity/information使用变量分裂后每个节点的增益减少,即节点的加权杂质平均值-左子节点的加权杂质平均值-加权杂质右子节点的平均值(另请参阅:https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting)
我想知道 xgboost 是否也使用上面引用中所述的信息增益或准确性的这种方法。我试图深入挖掘xgboost的代码,发现了这个方法(已经删掉了不相关的部分):
def get_score(self, fmap='', importance_type='gain'):
trees = self.get_dump(fmap, with_stats=True)
importance_type += '='
fmap = {}
gmap = {}
for tree in trees:
for line in tree.split('\n'):
# look for the opening square bracket
arr = line.split('[')
# if no opening bracket (leaf node), ignore this line
if len(arr) == 1:
continue
# look for the closing bracket, extract only info within that bracket
fid = arr[1].split(']')
# extract gain or cover from string after closing bracket
g = float(fid[1].split(importance_type)[1].split(',')[0])
# extract feature name from string before closing bracket
fid = fid[0].split('<')[0]
if fid not in fmap:
# if the feature hasn't been seen yet
fmap[fid] = 1
gmap[fid] = g
else:
fmap[fid] += 1
gmap[fid] += g
return gmap
所以 'gain' 是从每个助推器的转储文件中提取的,但它是如何实际测量的?
问得好。使用以下公式计算增益:
如需深入解释,请阅读:https://xgboost.readthedocs.io/en/latest/tutorials/model.html