来自 sklearn 的 SelectFromModel 在随机森林和梯度提升分类器上给出了显着不同的特征

Question

如标题中所述，我使用 SelectFromModel from sklearn to select features for both my random forest and gradient boosting classification models.

#feature selection performed on training dataset to prevent overfitting
sel = SelectFromModel(GradientBoostingClassifier(n_estimators=10, learning_rate=0.25,max_depth=1, max_features = 15, random_state=0).fit(X_train_bin, y_train))
sel.fit(X_train_bin, y_train)

#returns a boolean array to indicate which features are of importance (above the mean threshold)
sel.get_support()

#shows the names of the selected features
selected_feat= X_train_bin.columns[(sel.get_support())]
selected_feat

为随机森林和梯度提升模型返回的布尔数组完全不同。随机森林特征 selection 告诉我删除额外的 4 列（在 25 个特征中），梯度提升模型上的特征 selection 告诉我删除几乎所有内容。这里发生了什么？

编辑：我正在尝试比较这两个模型在我的数据集上的性能。我是否应该移动阈值，以便至少拥有大致相同数量的特征进行训练？

Answer 1

他们没有理由 select 相同的变量。 GradientBoostingClassifier 构建每棵树以改进上一步的错误，而 RandomForestClassifier 训练与彼此的错误无关的独立树。

它们可能 select 不同特征的另一个原因是 criterion，这是随机森林的熵和梯度提升的 Friedman MSE。最后，这可能是因为两种算法在进行每次拆分时都会 select 随机特征子集。因此，他们没有以相同的顺序比较相同的变量，这自然会产生不同的重要性。

来自 sklearn 的 SelectFromModel 在随机森林和梯度提升分类器上给出了显着不同的特征

SelectFromModel from sklearn gives significantly different features on random forest and gradient boosting classifier

python

machine-learning

feature-selection

random-forest

scikit-learn