Python 根据 feature_importances_ 对 NumPy 数组进行切片

Question

我有一组特征作为 NumPy 数组。

RandomForestRegressor in Scikit-Learn returns feature_importances_，其中有所有特征的重要性值。

我需要对 NumPy 数组进行切片，以便仅保留最重要的 50 个特征，并删除其他列。

我怎样才能轻松做到这一点？

Answer 1

如果我没理解错的话，你要找的是argsort。它将 return 索引按递增顺序放入排序数组中。例如：

import numpy as np
from sklearn.ensemble import RandomForestRegressor as RFR

# Create a random number generator so this example is repeatable
rs = np.random.RandomState(seed=1234)

# create 100 fake input variables with 10 features each
X = rs.rand(100, 10)
# create 100 fake response variables
Y = rs.rand(100)

rfr = RFR(random_state=rs)
rfr.fit(X, Y)

fi = rfr.feature_importances_
# argsort the feature importances and reverse to get order of decreasing importance
indices = argsort(fi)[::-1]

indices 现在包含按特征重要性递减顺序排列的输入变量的索引。

In: print indices
[7 6 3 4 5 0 1 9 2 8]
In: print fi[indices]
[ 0.22636046  0.19925157  0.17233547  0.09245424  0.08287206  0.0800437
  0.07174068  0.05554476  0.01044851  0.00894855]

通过适当切片，在输入变量中保留第一个 n 最重要的特征：

X[:, indices[:n]] # n most important features

Python 根据 feature_importances_ 对 NumPy 数组进行切片

Python Slicing NumPy Array according to feature_importances_

python

arrays

numpy

scikit-learn