特征选择嵌入方法显示错误特征

Question

在特征选择（嵌入式方法）中我得到了错误的特征。

特征选择代码：

# create the random forest model
model = RandomForestRegressor(n_estimators=120)

# fit the model to start training.
model.fit(X_train[_columns], X_train['delay_in_days'])

# get the importance of the resulting features.
importances = model.feature_importances_

# create a data frame for visualization.
final_df = pd.DataFrame({"Features": X_train[_columns].columns, "Importances":importances})
final_df.set_index('Importances')

# sort in descending order 
final_df = final_df.sort_values('Importances',ascending=False)

#visualising feature importance
pd.Series(model.feature_importances_, index=X_train[_columns].columns).nlargest(10).plot(kind='barh')

_columns #my some selected features

enter image description here

这里是功能列表，你可以看到total_open_amount是非常重要的功能但是当我在我的模型中放入前 3 个特征时，我得到 -ve R2_Score。但如果我删除 total_open_amount 从我的模型我变得体面 R2_Score.

我的问题是这是什么原因造成的？（所有数据训练、测试都是从大小为 100000 的数据集中随机选择的）

clf = RandomForestRegressor()
clf.fit(x_train, y_train)

# Predicting the Test Set Results
predicted = clf.predict(x_test)

Answer 1

这是一个有根据的猜测，因为您没有提供数据本身。查看您的功能名称，最重要的功能是名称客户和总未结金额。我想这些是具有很多独特价值的特征。

如果您检查随机森林的 help page，它会提到：

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance as an alternative.

这在一篇文章中也有提到by Strobl et al:

We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories.

我会尝试排列重要性，看看我是否得到相同的结果。

特征选择嵌入方法显示错误特征

features selection embedded method showing wrong features

python

machine-learning

random-forest