尽管文档中提到了什么,Python 中的 xgboost 并未返回功能的重要性
xgboost in Python is not returning importance of features despite what is referred in the documentation
根据 xgboost 文档 (https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training),xgboost returns 功能重要性:
feature_importances_
Feature importances property
Note
Feature importance is defined only for tree boosters. Feature importance is only defined when the decision tree model is chosen as base learner
((booster=gbtree). It is not defined for other base learner types, such as linear learners (booster=gblinear).
Returns: feature_importances_
Return type: array of shape [n_features]
然而,这似乎并非如此,如以下玩具示例所示:
import seaborn as sns
import xgboost as xgb
mpg = sns.load_dataset('mpg')
toy = mpg[['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration']]
toy = toy.sample(frac=1)
N = toy.shape[0]
N1 = int(N/2)
toy_train = toy.iloc[:N1, :]
toy_test = toy.iloc[N1:, :]
toy_train_x = toy_train.iloc[:, 1:]
toy_train_y = toy_train.iloc[:, 1]
toy_test_x = toy_test.iloc[:, 1:]
toy_test_y = toy_test.iloc[:, 1]
max_depth = 6
eta = 0.3
subsample = 0.8
colsample_bytree = 0.7
alpha = 0.1
params = {"booster" : 'gbtree' , 'objective' : 'reg:linear' , 'max_depth' : max_depth, 'eta' : eta,\
'subsample' : subsample, 'colsample_bytree' : colsample_bytree, 'alpha' : alpha}
dtrain_toy = xgb.DMatrix(data = toy_train_x , label = toy_train_y)
dtest_toy = xgb.DMatrix(data = toy_test_x, label = toy_test_y)
watchlist = [(dtest_toy, 'eval'), (dtrain_toy, 'train')]
xg_reg_toy = xgb.train(params = params, dtrain = dtrain_toy, num_boost_round = 1000, evals = watchlist, \
early_stopping_rounds = 20)
xg_reg_toy.feature_importances_
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-378-248f7887e307> in <module>()
----> 1 xg_reg_toy.feature_importances_
AttributeError: 'Booster' object has no attribute 'feature_importances_'
您使用的是Learning API, but you are referencing to Scikit-Learn API。并且只有 Scikit-Learn API 具有属性 feature_importances
.
对于像我这样没有使用 Scikit-Learn 的人 API,原因很明显。
从 here 我能够得到该特征的重要性:
clf.get_score()
此外,我正在研究更直观的表示方式 here:
from xgboost import plot_importance
plot_importance(clf, max_num_features=10)
这将生成具有指定(可选)max_num_features
重要性的条形图。
根据 xgboost 文档 (https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training),xgboost returns 功能重要性:
feature_importances_
Feature importances property
Note
Feature importance is defined only for tree boosters. Feature importance is only defined when the decision tree model is chosen as base learner ((booster=gbtree). It is not defined for other base learner types, such as linear learners (booster=gblinear).
Returns: feature_importances_
Return type: array of shape [n_features]
然而,这似乎并非如此,如以下玩具示例所示:
import seaborn as sns
import xgboost as xgb
mpg = sns.load_dataset('mpg')
toy = mpg[['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration']]
toy = toy.sample(frac=1)
N = toy.shape[0]
N1 = int(N/2)
toy_train = toy.iloc[:N1, :]
toy_test = toy.iloc[N1:, :]
toy_train_x = toy_train.iloc[:, 1:]
toy_train_y = toy_train.iloc[:, 1]
toy_test_x = toy_test.iloc[:, 1:]
toy_test_y = toy_test.iloc[:, 1]
max_depth = 6
eta = 0.3
subsample = 0.8
colsample_bytree = 0.7
alpha = 0.1
params = {"booster" : 'gbtree' , 'objective' : 'reg:linear' , 'max_depth' : max_depth, 'eta' : eta,\
'subsample' : subsample, 'colsample_bytree' : colsample_bytree, 'alpha' : alpha}
dtrain_toy = xgb.DMatrix(data = toy_train_x , label = toy_train_y)
dtest_toy = xgb.DMatrix(data = toy_test_x, label = toy_test_y)
watchlist = [(dtest_toy, 'eval'), (dtrain_toy, 'train')]
xg_reg_toy = xgb.train(params = params, dtrain = dtrain_toy, num_boost_round = 1000, evals = watchlist, \
early_stopping_rounds = 20)
xg_reg_toy.feature_importances_
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-378-248f7887e307> in <module>()
----> 1 xg_reg_toy.feature_importances_
AttributeError: 'Booster' object has no attribute 'feature_importances_'
您使用的是Learning API, but you are referencing to Scikit-Learn API。并且只有 Scikit-Learn API 具有属性 feature_importances
.
对于像我这样没有使用 Scikit-Learn 的人 API,原因很明显。 从 here 我能够得到该特征的重要性:
clf.get_score()
此外,我正在研究更直观的表示方式 here:
from xgboost import plot_importance
plot_importance(clf, max_num_features=10)
这将生成具有指定(可选)max_num_features
重要性的条形图。