使用 xgboost 绘制特征重要性

Question

当我绘制特征重要性时，我得到了这个混乱的图。我有 7000 多个变量。我知道内置函数只选择最重要的，尽管最终图表不可读。这是完整的代码：

import numpy as np
import pandas as pd
df = pd.read_csv('ricerice.csv')
array=df.values
X = array[:,0:7803]
Y = array[:,7804]
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
seed=0
test_size=0.30
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size, random_state=seed)
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X, Y)
import matplotlib.pyplot as plt
from matplotlib import pyplot
from xgboost import plot_importance
fig1=plt.gcf()
plot_importance(model)
plt.draw()
fig1.savefig('xgboost.png', figsize=(50, 40), dpi=1000)

虽然图的大小，但图形难以辨认。

Answer 1

有几点：

要拟合模型，您要使用训练数据集 (X_train, y_train)，而不是整个数据集 (X, y)。
您可以使用 plot_importance() 函数的 max_num_features 参数来仅显示前 max_num_features 个特征（例如前 10 个）。

对您的代码进行上述修改后，使用一些随机生成的数据，代码和输出如下：

import numpy as np

# generate some random data for demonstration purpose, use your original dataset here
X = np.random.rand(1000,100)     # 1000 x 100 data
y = np.random.rand(1000).round() # 0, 1 labels

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
seed=0
test_size=0.30
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size, random_state=seed)
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
import matplotlib.pylab as plt
from matplotlib import pyplot
from xgboost import plot_importance
plot_importance(model, max_num_features=10) # top 10 most important features
plt.show()

Answer 2

您可以从具有 feature_importances_ 属性的 Xgboost 模型中获取特征重要性。在您的情况下，它将是：

model.feature_imortances_

此属性是每个特征具有 gain 重要性的数组。然后你可以绘制它：

from matplotlib import pyplot as plt
plt.barh(feature_names, model.feature_importances_)

（feature_names 是一个包含功能名称的列表）

您可以对数组进行排序，select您想要的特征数量（例如，10）：

sorted_idx = model.feature_importances_.argsort()
plt.barh(feature_names[sorted_idx][:10], model.feature_importances_[sorted_idx][:10])
plt.xlabel("Xgboost Feature Importance")

还有两种获取特征重要性的方法：

您可以使用 scikit-learn 中的 permutation_importance（从版本 0.22 开始）
您可以使用 SHAP 值

您可以在我的 blog post 中阅读更多内容。

Answer 3

您需要先按降序排列特征重要性：

sorted_idx = trained_mdl.feature_importances_.argsort()[::-1]

然后用数据框中的列名绘制它们

from matplotlib import pyplot as plt
n_top_features = 10
sorted_idx = trained_mdl.feature_importances_.argsort()[::-1]
plt.barh(X_test.columns[sorted_idx][:n_top_features ], trained_mdl.feature_importances_[sorted_idx][:n_top_features ])

使用 xgboost 绘制特征重要性

Plot feature importance with xgboost

python

machine-learning

matplotlib

feature-selection

xgboost