特征选择的名称

Names for Feature Selection

我想知道我的 RF 模型中的特征名称。我读 here that the output from gs.best_estimator_.named_steps["stepname"].feature_importances_ would mirror my columns from my data. However, the length of gs.best_estimator_.... is 10 and I have 13 columns. Some columns were not important. From other answers around (answer1, ),我必须在我的管道中声明一些东西。但是我对声明什么感到困惑,因为这两个答案都涉及 PCA,而不是 RF。

这是我目前的情况。

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import datasets

# use iris as example
iris = datasets.load_iris()
X = iris.drop(['sepal_length'],axis=1)
y = iris.sepal_length
cats_feats = ['species']
X_train, X_test, y_train, y_test = \
        train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=13)
# Pipeline
categorical_transformer = Pipeline(steps=[
                ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False))
                                    ])
# Bundle any preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, cat_feats)
    ])
rf = RandomForestRegressor(random_state = 13)
mymodel = Pipeline(steps = [('preprocessor', preprocessor),
                            ('model', rf)
                            ])
# For this example, I used default values. In reality I do use a dictionary of parameters
gs = GridSearchCV(mymodel
                           ,n_jobs = -1
                           ,cv = 5
                           )
gs.fit(X_train,y_train)

为什么功能列表的长度不匹配

您的特征长度不匹配,因为在您使用 ColumnTransformer 时所有非分类列都被丢弃了。默认情况下,它只保留指定了转换的列。因此,如果您不希望这种情况发生,您需要这样做

preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), cat_feats)],
                                 remainder='passthrough')

(我删除了你的分类管道,这里不需要)

另请记住,应用 OHE 会添加特征,因此特征总数将比您开始时的数量更多。

如何获取特征名称

一旦你安装了所有东西,你需要检索 OHE 结果的特征名称和剩余的数字列。

对于 OHE 列:

cat_features = gs.best_estimator_["preprocessor"].named_transformers_["cat"].get_feature_names()

对于数值列,您需要声明 num_feats,其中所有数值特征的顺序与原始数据框中的顺序相同。

然后就这样做:

feature_names = np.concatenate((cat_features, num_feats))

PS:这个有点麻烦,以后的sklearn版本可能会改进,但目前是这样的过程