每个目标变量的特征重要性和选择

Question

因为我有太多的特征，所以我想减少数量，并找到了一种方法，用这个 RandomForestClassifier 来确定特征的重要性。

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=1, random_state=42)
rnd_clf.fit(X, y)




a = {name: importance for name, importance in zip(NUMBER, rnd_clf.feature_importances_)}

df = pd.DataFrame(list(a.items()), columns=['name', 'importance'])

df2 = df.sort_values('importance',ascending=False)

但是因为我有 6 个目标变量，所以我想确定每个目标变量的哪些特征很重要，而不是上面代码中的所有特征。

我试图删除学习集中的其他目标变量，但这没有用，因为所有重要性都设置为 0。我该如何解决这个问题？

编辑：示例数据：Partij 是 Y。其他变量是 X（以及更多）

gemeente    Partij  Perioden    Bevolking/Bevolkingssamenstelling op 1 januari/Totale bevolking (aantal)    Bevolking/Bevolkingssamenstelling op 1 januari/Geslacht/Mannen (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Geslacht/Vrouwen (aantal)    Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/Jonger dan 5 jaar (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/5 tot 10 jaar (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/10 tot 15 jaar (aantal)    Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/15 tot 20 jaar (aantal)
0   's-Hertogenbosch    VVD 2007    135648.0    66669.0 68979.0 7986.0  7809.0  7514.0  7612.0  ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1   's-Hertogenbosch    VVD 2008    136481.0    67047.0 69434.0 7885.0  7853.0  7517.0  7680.0  ... 5.8 8.6 41.3    5.2 4.0 20.0    4.0 5.0 25.0    3.0
2   's-Hertogenbosch    VVD 2009    137775.0    67715.0 70060.0 7915.0  7890.0  7497.0  7628.0  ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3   's-Hertogenbosch    VVD 2010    139607.0    68628.0 70979.0 8127.0  7852.0  7527.0  7752.0  ... 5.6 8.4 40.7    5.4 4.0 20.0    3.0 5.0 24.0    3.0
4   Aa en Hunze PVDA    2007    25563.0 12653.0 12910.0

Answer 1

如果您想了解每个目标变量的哪些特征重要，请为 6 个目标变量创建 6 个数据集。

为了探索特征的重要性，比基于树的模型的feature_importance方法有更多的可能性。

相关系数
基于模型的排名
稳定性选择
RFE

每个目标变量的特征重要性和选择

Feature importances and selection per target variable

python

feature-selection

pandas

scikit-learn