特征选择嵌入方法显示错误特征

features selection embedded method showing wrong features

在特征选择(嵌入式方法)中我得到了错误的特征。

特征选择代码:

# create the random forest model
model = RandomForestRegressor(n_estimators=120)

# fit the model to start training.
model.fit(X_train[_columns], X_train['delay_in_days'])

# get the importance of the resulting features.
importances = model.feature_importances_

# create a data frame for visualization.
final_df = pd.DataFrame({"Features": X_train[_columns].columns, "Importances":importances})
final_df.set_index('Importances')

# sort in descending order 
final_df = final_df.sort_values('Importances',ascending=False)

#visualising feature importance
pd.Series(model.feature_importances_, index=X_train[_columns].columns).nlargest(10).plot(kind='barh')
_columns #my some selected features 

enter image description here

这里是功能列表,你可以看到total_open_amount是非常重要的功能 但是当我在我的模型中放入前 3 个特征时,我得到 -ve R2_Score。但如果我删除 total_open_amount 从我的模型我变得体面 R2_Score.

我的问题是这是什么原因造成的?(所有数据训练、测试都是从大小为 100000 的数据集中随机选择的)

clf = RandomForestRegressor()
clf.fit(x_train, y_train)

# Predicting the Test Set Results
predicted = clf.predict(x_test)

这是一个有根据的猜测,因为您没有提供数据本身。查看您的功能名称,最重要的功能是名称客户和总未结金额。我想这些是具有很多独特价值的特征。

如果您检查随机森林的 help page,它会提到:

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance as an alternative.

这在一篇文章中也有提到by Strobl et al:

We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories.

我会尝试排列重要性,看看我是否得到相同的结果。