随机森林特征重要性：实际使用了多少？

Question

我连续用了两次射频

首先，我使用 max_features='auto' 和整个数据集（109 个特征）对其进行拟合，以执行特征选择。以下是 RandomForestClassifier.feature_importances_，它正确地给出了每个特征 109 分：

[0.00118087,  0.01268531,  0.0017589 ,  0.01614814,  0.01105567,
0.0146838 ,  0.0187875 ,  0.0190427 ,  0.01429976,  0.01311706,
0.01702717,  0.00901344,  0.01044047,  0.00932331,  0.01211333,
0.01271825,  0.0095337 ,  0.00985686,  0.00952823,  0.01165877,
0.00193286,  0.0012602 ,  0.00208145,  0.00203459,  0.00229907,
0.00242616,  0.00051358,  0.00071606,  0.00975515,  0.00171034,
0.01134927,  0.00687018,  0.00987706,  0.01507474,  0.01223525,
0.01170495,  0.00928417,  0.01083082,  0.01302036,  0.01002457,
0.00894818,  0.00833564,  0.00930602,  0.01100774,  0.00818604,
0.00675784,  0.00740617,  0.00185461,  0.00119627,  0.00159034,
0.00154336,  0.00478926,  0.00200773,  0.00063574,  0.00065675,
0.01104192,  0.00246746,  0.01663812,  0.01041134,  0.01401842,
0.02038318,  0.0202834 ,  0.01290935,  0.01476593,  0.0108275 ,
0.0118773 ,  0.01050919,  0.0111477 ,  0.00684507,  0.01170021,
0.01291888,  0.00963295,  0.01161876,  0.00756015,  0.00178329,
0.00065709,  0.        ,  0.00246064,  0.00217982,  0.00305187,
0.00061284,  0.00063431,  0.01963523,  0.00265208,  0.01543552,
0.0176546 ,  0.01443356,  0.01834896,  0.01385694,  0.01320648,
0.00966011,  0.0148321 ,  0.01574166,  0.0167107 ,  0.00791634,
0.01121442,  0.02171706,  0.01855552,  0.0257449 ,  0.02925843,
0.01789742,  0.        ,  0.        ,  0.00379275,  0.0024365 ,
0.00333905,  0.00238971,  0.00068355,  0.00075399]

然后，我在之前的拟合上转换数据集，这应该会降低它的维度，然后我在它上面重新拟合 RF。给定 max_features='auto' 和 109 个专长，我希望总共有 ~10 个特征，调用 rf.feats_importance_、returns 更多 (62):

[ 0.01261971, 0.02003921, 0.00961297, 0.02505467, 0.02038449,
0.02353745, 0.01893777, 0.01932577, 0.01681398, 0.01464485,
0.01672119, 0.00748981, 0.01109461, 0.01116948, 0.0087081 ,
0.01056344, 0.00971319, 0.01532258, 0.0167348 , 0.01601214,
0.01522208, 0.01625487, 0.01653784, 0.01483562, 0.01602748,
0.01522369, 0.01581573, 0.01406688, 0.01269036, 0.00884105,
0.02538574, 0.00637611, 0.01928382, 0.02061512, 0.02566056,
0.02180902, 0.01537295, 0.01796305, 0.01171095, 0.01179759,
0.01371328, 0.00811729, 0.01060708, 0.015717 , 0.01067911,
0.01773623, 0.0169396 , 0.0226369 , 0.01547827, 0.01499467,
0.01356075, 0.01040735, 0.01360752, 0.01754145, 0.01446933,
0.01845195, 0.0190799 , 0.02608652, 0.02095663, 0.02939744,
0.01870901, 0.02512201]

为什么？ returns 不应该只有 ~10 个重要特征吗？

Answer 1

你误解了max_features的意思，就是

The number of features to consider when looking for the best split

不是转换数据时的特征个数。

transform方法中的threshold决定了最重要的特征。

threshold : string, float or None, optional (default=None)

The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if available, the object attribute threshold is used. Otherwise, “mean” is used by default.

随机森林特征重要性：实际使用了多少？

Random Forest feature importance: how many are actually used?

feature-selection

random-forest

scikit-learn