如何在随机搜索和一个热编码数据后通过列名获得特征重要性？

Question

我写了下面的代码块。找到最佳估算器后，我想了解模型的特征重要性。但我无法弄清楚如何正确使用列名。

scaler = StandardScaler()
ohe = OneHotEncoder(categories=unique_list, sparse=False)

col_transformers = ColumnTransformer([
                          ("scaler_onestep", scaler, numerical_columns),
                          ("ohe_onestep", ohe, categorical_columns)])


param_grid = {
        'XGB__estimator__max_depth': [3, 5, 7, 10],
        'XGB__estimator__learning_rate': [0.01, 0.1],
        'XGB__estimator__n_estimators': [100]}

model = MultiOutputClassifier(xgb.XGBClassifier(objective="binary:logistic"))

#Define a pipeline
pipeline = Pipeline([("preprocessing", col_transformers), ("XGB", model)])

rs_clf = RandomizedSearchCV(pipeline, param_grid, n_iter=3,
                            n_jobs=-1, verbose=2, cv=2, scoring="accuracy", refit=True, random_state=42)

rs_clf.fit(X, y)

这为我提供了第一个标签的特征重要性结果。

rs_clf.best_estimator_.named_steps["XGB"].estimators_[0].feature_importances_

这给了我分类。

rs_clf.best_estimator_.named_steps["preprocessing"].transformers[1][1].categories

result 有 389 列，X 有 279 列，所以我不能直接写，对于一个热编码数据我怎么能这样做呢？我怎样才能找到这 389 个列的名称？

Answer 1

get_feature_names 方法在这里会有很大帮助。目前StandardScaler不支持；由于 xgboost 完全不受特征缩放的影响，我建议放弃它并用 "passthrough" 替换 ColumnTransformer 的数字部分。然后 rs_clf.best_estimator_.named_steps["preprocessing"].get_feature_names() 应该按照它们到达 XGB 的顺序给出特征。

如何在随机搜索和一个热编码数据后通过列名获得特征重要性？

How to get feature importances with column names after randomizedsearch and one hot encoded data?

python

pandas

scikit-learn

one-hot-encoding