如何从 sklearn 管道转换器中提取特征名称?
How to extract feature names from sklearn pipeline transformers?
供参考:
- Python 3.8.3
- sklearn 1.0.2
我有一个 scikit-learn pipeline
可以为我格式化一些数据,如下所述:
我这样定义我的 pipeline
:
# Pipeline 1
cat_selector = make_column_selector(dtype_include=object)
num_selector = make_column_selector(dtype_include=np.number)
cat_linear_processor = OneHotEncoder(handle_unknown="ignore", drop='first', sparse=False)
num_linear_processor = make_pipeline(SimpleImputer(strategy="median", add_indicator=True), MinMaxScaler(feature_range=(-1,1)))
linear_preprocessor = make_column_transformer( (num_linear_processor, num_selector), (cat_linear_processor, cat_selector) )
model_params ={'alpha': 0.0013879181970625643,
'l1_ratio': 0.9634269882730605,
'fit_intercept': True,
'normalize': False,
'max_iter': 245.69684524349375,
'tol': 0.01855761485447601,
'positive': False,
'selection': 'random'}
model = ElasticNet(**model_params)
pipeline = make_pipeline(linear_preprocessor, model)
pipeline.steps
产量:
[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(add_indicator=True,
strategy='median')),
('minmaxscaler',
MinMaxScaler(feature_range=(-1,
1)))]),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000029CA3231EE0>),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore',
sparse=False),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000029CA542F040>)])),
('elasticnet',
ElasticNet(alpha=0.0013879181970625643, l1_ratio=0.9634269882730605,
max_iter=245.69684524349375, normalize=False, selection='random',
tol=0.01855761485447601))]
我想做的是检索 trained/tested 上的数据的特征名称。
我已尝试引用许多其他问题:
但是,这些解决方案都没有奏效。例如:
[i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]
产量:
----> 1 [i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]
NameError: name 'v' is not defined
我试过了:
pipeline[:-1].get_feature_names_out()
产量:
AttributeError: Estimator simpleimputer does not provide get_feature_names_out. Did you mean to call pipeline[:-1].get_feature_names_out()?
如何从当前管道编码后检索特征名称?
我想这个 post 可能有帮助:
也就是说,问题应该只是sklearn的版本。我在几个月前 posted 中引用的 PR 似乎刚刚合并,尽管从那以后还没有新版本。安装实际的 sklearn 开发版本,scikit-learn 1.1.dev0
应该可以解决问题(至少对我来说是这样)。
您可以这样安装 nightly builds:pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn -U
。
这是一个关于 toy 数据集的例子:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.linear_model import ElasticNet
X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw', ''],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath', 'The Jungle'],
'expert_rating': [5, 3, 4, 5, 3],
'user_rating': [4, 5, 4, 2, 3]})
# Pipeline 1
cat_selector = make_column_selector(dtype_include=object)
num_selector = make_column_selector(dtype_include=np.number)
cat_linear_processor = OneHotEncoder(handle_unknown="ignore", drop='first', sparse=False)
num_linear_processor = make_pipeline(SimpleImputer(strategy="median", add_indicator=True), MinMaxScaler(feature_range=(-1,1)))
linear_preprocessor = make_column_transformer( (num_linear_processor, num_selector), (cat_linear_processor, cat_selector) )
model_params ={
'alpha': 0.0013879181970625643,
'l1_ratio': 0.9634269882730605,
'fit_intercept': True,
'normalize': False,
'max_iter': 245,
'tol': 0.01855761485447601,
'positive': False,
'selection': 'random'}
model = ElasticNet(**model_params)
pipeline = make_pipeline(linear_preprocessor, model)
pipeline.fit(X.iloc[:, :-1], X.iloc[:, -1])
pipeline[:-1].get_feature_names_out()
供参考:
- Python 3.8.3
- sklearn 1.0.2
我有一个 scikit-learn pipeline
可以为我格式化一些数据,如下所述:
我这样定义我的 pipeline
:
# Pipeline 1
cat_selector = make_column_selector(dtype_include=object)
num_selector = make_column_selector(dtype_include=np.number)
cat_linear_processor = OneHotEncoder(handle_unknown="ignore", drop='first', sparse=False)
num_linear_processor = make_pipeline(SimpleImputer(strategy="median", add_indicator=True), MinMaxScaler(feature_range=(-1,1)))
linear_preprocessor = make_column_transformer( (num_linear_processor, num_selector), (cat_linear_processor, cat_selector) )
model_params ={'alpha': 0.0013879181970625643,
'l1_ratio': 0.9634269882730605,
'fit_intercept': True,
'normalize': False,
'max_iter': 245.69684524349375,
'tol': 0.01855761485447601,
'positive': False,
'selection': 'random'}
model = ElasticNet(**model_params)
pipeline = make_pipeline(linear_preprocessor, model)
pipeline.steps
产量:
[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(add_indicator=True,
strategy='median')),
('minmaxscaler',
MinMaxScaler(feature_range=(-1,
1)))]),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000029CA3231EE0>),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore',
sparse=False),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000029CA542F040>)])),
('elasticnet',
ElasticNet(alpha=0.0013879181970625643, l1_ratio=0.9634269882730605,
max_iter=245.69684524349375, normalize=False, selection='random',
tol=0.01855761485447601))]
我想做的是检索 trained/tested 上的数据的特征名称。
我已尝试引用许多其他问题:
但是,这些解决方案都没有奏效。例如:
[i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]
产量:
----> 1 [i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]
NameError: name 'v' is not defined
我试过了:
pipeline[:-1].get_feature_names_out()
产量:
AttributeError: Estimator simpleimputer does not provide get_feature_names_out. Did you mean to call pipeline[:-1].get_feature_names_out()?
如何从当前管道编码后检索特征名称?
我想这个 post 可能有帮助:
也就是说,问题应该只是sklearn的版本。我在几个月前 posted 中引用的 PR 似乎刚刚合并,尽管从那以后还没有新版本。安装实际的 sklearn 开发版本,scikit-learn 1.1.dev0
应该可以解决问题(至少对我来说是这样)。
您可以这样安装 nightly builds:pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn -U
。
这是一个关于 toy 数据集的例子:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.linear_model import ElasticNet
X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw', ''],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath', 'The Jungle'],
'expert_rating': [5, 3, 4, 5, 3],
'user_rating': [4, 5, 4, 2, 3]})
# Pipeline 1
cat_selector = make_column_selector(dtype_include=object)
num_selector = make_column_selector(dtype_include=np.number)
cat_linear_processor = OneHotEncoder(handle_unknown="ignore", drop='first', sparse=False)
num_linear_processor = make_pipeline(SimpleImputer(strategy="median", add_indicator=True), MinMaxScaler(feature_range=(-1,1)))
linear_preprocessor = make_column_transformer( (num_linear_processor, num_selector), (cat_linear_processor, cat_selector) )
model_params ={
'alpha': 0.0013879181970625643,
'l1_ratio': 0.9634269882730605,
'fit_intercept': True,
'normalize': False,
'max_iter': 245,
'tol': 0.01855761485447601,
'positive': False,
'selection': 'random'}
model = ElasticNet(**model_params)
pipeline = make_pipeline(linear_preprocessor, model)
pipeline.fit(X.iloc[:, :-1], X.iloc[:, -1])
pipeline[:-1].get_feature_names_out()