如何在 feature_importances_ 中为 sklearn 中的 LightGBM 分类器设置 'gain' 作为特征重要性度量 :: LGBMClassifier()
How to set 'gain' as Feature Importance measure in feature_importances_ for LightGBM classifer in sklearn:: LGBMClassifier()
我正在使用 LightGBM 中的 LGBMClassifer 构建二进制 classifier 模型,类似如下:
# LightGBM model
clf = LGBMClassifier(
nthread=4,
n_estimators=10000,
learning_rate=0.005,
num_leaves= 45,
colsample_bytree= 0.8,
subsample= 0.4,
subsample_freq=1,
max_depth= 20,
reg_alpha= 0.5,
reg_lambda=0.5,
min_split_gain=0.04,
min_child_weight=.05
random_state=0,
silent=-1,
verbose=-1)
接下来,在训练数据上拟合我的模型
clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)],
eval_metric= 'auc', verbose= 100, early_stopping_rounds= 200)
fold_importance_df = pd.DataFrame()
fold_importance_df["feature"] = feats
fold_importance_df["importance"] = clf.feature_importances_
输出:
feature importance
feature13 1108
feature21 1104
feature11 774
到这里一切都很好,现在我正在查看基于此模型的特征重要性度量。所以,我正在使用 feature_importance_()
函数来获取它(但默认情况下它根据 split
给我特征重要性)
虽然 split
让我了解哪个特征在拆分中使用了多少次,但我认为 gain
会让我更好地理解特征的重要性。
Python API of LightGBM booster class https://lightgbm.readthedocs.io/en/latest/Python-API.html?highlight=importance 提及:
feature_importance(importance_type='split', iteration=-1)
Parameters:importance_type (string, optional (default="split")) –
If “split”, result contains numbers
of times the feature is used in a model. If “gain”, result contains
total gains of splits which use the feature.
Returns: result – Array with feature importances.
Return type: numpy array`
然而,Sklearn API for LightGBM LGBMClassifier()
没有提到任何东西 Sklearn API LGBM,它只有这个函数的参数:
feature_importances_
array of shape = [n_features] – The feature importances (the higher, the more important the feature).
- 我的问题是如何从
sklearn
版本获得特征重要性,即基于 gain
的 LGBMClassifier()
?
feature_importance()
是原LGBM中Booster对象的一个方法。
sklearn API 通过 API Docs 中给出的属性 booster_
在训练数据上公开底层 Booster。
所以您可以先访问这个助推器对象,然后像在原始 LGBM 上一样调用 feature_importance()
。
clf.booster_.feature_importance(importance_type='gain')
我正在使用 LightGBM 中的 LGBMClassifer 构建二进制 classifier 模型,类似如下:
# LightGBM model
clf = LGBMClassifier(
nthread=4,
n_estimators=10000,
learning_rate=0.005,
num_leaves= 45,
colsample_bytree= 0.8,
subsample= 0.4,
subsample_freq=1,
max_depth= 20,
reg_alpha= 0.5,
reg_lambda=0.5,
min_split_gain=0.04,
min_child_weight=.05
random_state=0,
silent=-1,
verbose=-1)
接下来,在训练数据上拟合我的模型
clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)],
eval_metric= 'auc', verbose= 100, early_stopping_rounds= 200)
fold_importance_df = pd.DataFrame()
fold_importance_df["feature"] = feats
fold_importance_df["importance"] = clf.feature_importances_
输出:
feature importance
feature13 1108
feature21 1104
feature11 774
到这里一切都很好,现在我正在查看基于此模型的特征重要性度量。所以,我正在使用
feature_importance_()
函数来获取它(但默认情况下它根据split
给我特征重要性)虽然
split
让我了解哪个特征在拆分中使用了多少次,但我认为gain
会让我更好地理解特征的重要性。Python API of LightGBM booster class https://lightgbm.readthedocs.io/en/latest/Python-API.html?highlight=importance 提及:
feature_importance(importance_type='split', iteration=-1) Parameters:importance_type (string, optional (default="split")) – If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature. Returns: result – Array with feature importances. Return type: numpy array`
然而,Sklearn API for LightGBM LGBMClassifier()
没有提到任何东西 Sklearn API LGBM,它只有这个函数的参数:
feature_importances_
array of shape = [n_features] – The feature importances (the higher, the more important the feature).
- 我的问题是如何从
sklearn
版本获得特征重要性,即基于gain
的LGBMClassifier()
?
feature_importance()
是原LGBM中Booster对象的一个方法。
sklearn API 通过 API Docs 中给出的属性 booster_
在训练数据上公开底层 Booster。
所以您可以先访问这个助推器对象,然后像在原始 LGBM 上一样调用 feature_importance()
。
clf.booster_.feature_importance(importance_type='gain')