在 Gridsearch 之后显示选定的特征

Display selected features after Gridsearch

我正在使用 GridSearchCV 执行线性回归的特征选择 (SelectKBest)。结果显示选择了 10 个特征(使用 .best_params_),但我不确定如何显示这些特征。

代码贴在下面。我正在使用管道,因为下一个模型也需要选择超参数。 x_train 是一个包含 12 列的数据框,由于数据限制,我无法共享。

cv_folds = KFold(n_splits=5, shuffle=False)
steps = [('feature_selection', SelectKBest(mutual_info_regression, k=3)), ('regr', 
LinearRegression())]
pipe = Pipeline(steps)

search_space = [{'feature_selection__k': [1,2,3,4,5,6,7,8,9,10,11,12]}]

clf = GridSearchCV(pipe, search_space, scoring='neg_mean_squared_error', cv=5, verbose=0)
clf = clf.fit(x_train, y_train)

print(clf.best_params_)

您可以像这样访问有关 feature_selection 步骤的信息:

<GridSearch_model_variable>.best_estimater_.named_steps[<feature_selection_step>]

所以,在你的情况下,它会是这样的:

print(clf.best_estimator_.named_steps['feature_selection'])
#Output: SelectKBest(k=8, score_func=<function mutual_info_regression at 0x13d37b430>)

接下来可以使用get_support函数获取所选特征的布尔映射:

print(clf.best_estimator_.named_steps['feature_selection'].get_support())
# Output: array([ True, False,  True, False,  True,  True,  True, False, False,
        True,  True, False,  True])

现在在原始列上提供这张地图:

data_columns = X.columns # List of columns in your dataset

# This is the original list of columns
print(data_columns)
# Output: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT']

# Now print the select columns
print(data_columns[clf.best_estimator_.named_steps['feature_selection'].get_support()])
# Output: ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'TAX', 'PTRATIO', 'LSTAT']

所以你可以看到在 13 个特征中只选择了 8 个(在我的数据中 k=4 是最好的情况)

这是波士顿数据集的完整代码:

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

boston_dataset = load_boston()
X = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
y = boston_dataset.target

cv_folds = KFold(n_splits=5, shuffle=False)
steps = [('feature_selection', SelectKBest(mutual_info_regression, k=3)),
         ('regr', LinearRegression())]

pipe = Pipeline(steps)

search_space = [{'feature_selection__k': [1,2,3,4,5,6,7,8,9,10,11,12]}]

clf = GridSearchCV(pipe, search_space, scoring='neg_mean_squared_error', cv=5, verbose=0)
clf = clf.fit(X, y)

print(clf.best_params_)

data_columns = X.columns
selected_features = data_columns[clf.best_estimator_.named_steps['feature_selection'].get_support()]

print(selected_features)
# Output : Index(['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'TAX', 'PTRATIO', 'LSTAT'], dtype='object')

参考文献: