特征选择的名称
Names for Feature Selection
我想知道我的 RF 模型中的特征名称。我读 here that the output from gs.best_estimator_.named_steps["stepname"].feature_importances_
would mirror my columns from my data. However, the length of gs.best_estimator_....
is 10 and I have 13 columns. Some columns were not important. From other answers around (answer1, ),我必须在我的管道中声明一些东西。但是我对声明什么感到困惑,因为这两个答案都涉及 PCA,而不是 RF。
这是我目前的情况。
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import datasets
# use iris as example
iris = datasets.load_iris()
X = iris.drop(['sepal_length'],axis=1)
y = iris.sepal_length
cats_feats = ['species']
X_train, X_test, y_train, y_test = \
train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=13)
# Pipeline
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False))
])
# Bundle any preprocessing
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, cat_feats)
])
rf = RandomForestRegressor(random_state = 13)
mymodel = Pipeline(steps = [('preprocessor', preprocessor),
('model', rf)
])
# For this example, I used default values. In reality I do use a dictionary of parameters
gs = GridSearchCV(mymodel
,n_jobs = -1
,cv = 5
)
gs.fit(X_train,y_train)
为什么功能列表的长度不匹配
您的特征长度不匹配,因为在您使用 ColumnTransformer
时所有非分类列都被丢弃了。默认情况下,它只保留指定了转换的列。因此,如果您不希望这种情况发生,您需要这样做
preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), cat_feats)],
remainder='passthrough')
(我删除了你的分类管道,这里不需要)
另请记住,应用 OHE 会添加特征,因此特征总数将比您开始时的数量更多。
如何获取特征名称
一旦你安装了所有东西,你需要检索 OHE 结果的特征名称和剩余的数字列。
对于 OHE 列:
cat_features = gs.best_estimator_["preprocessor"].named_transformers_["cat"].get_feature_names()
对于数值列,您需要声明 num_feats
,其中所有数值特征的顺序与原始数据框中的顺序相同。
然后就这样做:
feature_names = np.concatenate((cat_features, num_feats))
PS:这个有点麻烦,以后的sklearn版本可能会改进,但目前是这样的过程
我想知道我的 RF 模型中的特征名称。我读 here that the output from gs.best_estimator_.named_steps["stepname"].feature_importances_
would mirror my columns from my data. However, the length of gs.best_estimator_....
is 10 and I have 13 columns. Some columns were not important. From other answers around (answer1,
这是我目前的情况。
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import datasets
# use iris as example
iris = datasets.load_iris()
X = iris.drop(['sepal_length'],axis=1)
y = iris.sepal_length
cats_feats = ['species']
X_train, X_test, y_train, y_test = \
train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=13)
# Pipeline
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False))
])
# Bundle any preprocessing
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, cat_feats)
])
rf = RandomForestRegressor(random_state = 13)
mymodel = Pipeline(steps = [('preprocessor', preprocessor),
('model', rf)
])
# For this example, I used default values. In reality I do use a dictionary of parameters
gs = GridSearchCV(mymodel
,n_jobs = -1
,cv = 5
)
gs.fit(X_train,y_train)
为什么功能列表的长度不匹配
您的特征长度不匹配,因为在您使用 ColumnTransformer
时所有非分类列都被丢弃了。默认情况下,它只保留指定了转换的列。因此,如果您不希望这种情况发生,您需要这样做
preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), cat_feats)],
remainder='passthrough')
(我删除了你的分类管道,这里不需要)
另请记住,应用 OHE 会添加特征,因此特征总数将比您开始时的数量更多。
如何获取特征名称
一旦你安装了所有东西,你需要检索 OHE 结果的特征名称和剩余的数字列。
对于 OHE 列:
cat_features = gs.best_estimator_["preprocessor"].named_transformers_["cat"].get_feature_names()
对于数值列,您需要声明 num_feats
,其中所有数值特征的顺序与原始数据框中的顺序相同。
然后就这样做:
feature_names = np.concatenate((cat_features, num_feats))
PS:这个有点麻烦,以后的sklearn版本可能会改进,但目前是这样的过程