特征选择的名称

Question

我想知道我的 RF 模型中的特征名称。我读 here that the output from gs.best_estimator_.named_steps["stepname"].feature_importances_ would mirror my columns from my data. However, the length of gs.best_estimator_.... is 10 and I have 13 columns. Some columns were not important. From other answers around (answer1, )，我必须在我的管道中声明一些东西。但是我对声明什么感到困惑，因为这两个答案都涉及 PCA，而不是 RF。

这是我目前的情况。

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import datasets

# use iris as example
iris = datasets.load_iris()
X = iris.drop(['sepal_length'],axis=1)
y = iris.sepal_length
cats_feats = ['species']
X_train, X_test, y_train, y_test = \
        train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=13)
# Pipeline
categorical_transformer = Pipeline(steps=[
                ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False))
                                    ])
# Bundle any preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, cat_feats)
    ])
rf = RandomForestRegressor(random_state = 13)
mymodel = Pipeline(steps = [('preprocessor', preprocessor),
                            ('model', rf)
                            ])
# For this example, I used default values. In reality I do use a dictionary of parameters
gs = GridSearchCV(mymodel
                           ,n_jobs = -1
                           ,cv = 5
                           )
gs.fit(X_train,y_train)

Answer 1

为什么功能列表的长度不匹配

您的特征长度不匹配，因为在您使用 ColumnTransformer 时所有非分类列都被丢弃了。默认情况下，它只保留指定了转换的列。因此，如果您不希望这种情况发生，您需要这样做

preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), cat_feats)],
                                 remainder='passthrough')

（我删除了你的分类管道，这里不需要）

另请记住，应用 OHE 会添加特征，因此特征总数将比您开始时的数量更多。

如何获取特征名称

一旦你安装了所有东西，你需要检索 OHE 结果的特征名称和剩余的数字列。

对于 OHE 列：

cat_features = gs.best_estimator_["preprocessor"].named_transformers_["cat"].get_feature_names()

对于数值列，您需要声明 num_feats，其中所有数值特征的顺序与原始数据框中的顺序相同。

然后就这样做：

feature_names = np.concatenate((cat_features, num_feats))

PS：这个有点麻烦，以后的sklearn版本可能会改进，但目前是这样的过程

特征选择的名称

Names for Feature Selection

python

feature-selection

scikit-learn

grid-search

为什么功能列表的长度不匹配

如何获取特征名称