随机森林feature_importance,如何溯源col?
Random Forrest feature_importance, how to trace back source col?
情况
我有一个 sklearn.ensemble.RandomForestRegressor
模型,想找到最重要的特征。
MWE
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
df = pd.DataFrame({
"ColA":[1,2,3,4,5,6,7,8,9,10],
"ColB":["CharA","CharB","CharC","CharD","CharE","CharF","CharG","CharH","CharI","CharJ"],
"ColC":[132,1000,5,20,165,852,403,680,481,6],
"ColD":[1,2,3,4,5,6,7,1,2,3],
"ColE":[2,26,5,7,1,2,3,12,65,12]
})
num_attribs=["ColA", "ColC"]
cat_attribs=["ColB", "ColD", "ColE"]
full_pipeline = ColumnTransformer([
("num", MinMaxScaler(), num_attribs),
("cat", OneHotEncoder(), cat_attribs)
])
prep_df = full_pipeline.fit_transform(df)
prep_df.toarray()[0]
输出:
array([0. , 0.12763819, 1. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 1. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
1. , 0. , 0. , 0. , 0. ,
0. , 0. ])
请注意:这只是演示代码,我知道如果我省略 toarray
会很容易,但在我的真实数据中(~1Mio 行,~70cols)结果是这种格式。
问题
我想回溯到原来的专栏,但我遇到了在ColumnTransformer
中使用OneHotEncoder
的情况。所以没有 inverse_transform
函数,我的列数比编码前多。
问题
如果 model.feature_importances_.argmax()
说 --> 15
,我如何找出那是哪个源列?
您可以在列转换器的转换器上使用 get_feature_names_out()
并索引到该数组。对于您的示例,它看起来像
idx = model.feature_importances_.argmax()
cols = [col for t in full_pipeline.transformers_ for col in t[1].get_feature_names_out()]
name = col[idx]
请注意,对于一个热门列,您会得到像 columnname_categoryname 这样的名称,这可能比列名本身更有趣。
情况
我有一个 sklearn.ensemble.RandomForestRegressor
模型,想找到最重要的特征。
MWE
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
df = pd.DataFrame({
"ColA":[1,2,3,4,5,6,7,8,9,10],
"ColB":["CharA","CharB","CharC","CharD","CharE","CharF","CharG","CharH","CharI","CharJ"],
"ColC":[132,1000,5,20,165,852,403,680,481,6],
"ColD":[1,2,3,4,5,6,7,1,2,3],
"ColE":[2,26,5,7,1,2,3,12,65,12]
})
num_attribs=["ColA", "ColC"]
cat_attribs=["ColB", "ColD", "ColE"]
full_pipeline = ColumnTransformer([
("num", MinMaxScaler(), num_attribs),
("cat", OneHotEncoder(), cat_attribs)
])
prep_df = full_pipeline.fit_transform(df)
prep_df.toarray()[0]
输出:
array([0. , 0.12763819, 1. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 1. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
1. , 0. , 0. , 0. , 0. ,
0. , 0. ])
请注意:这只是演示代码,我知道如果我省略 toarray
会很容易,但在我的真实数据中(~1Mio 行,~70cols)结果是这种格式。
问题
我想回溯到原来的专栏,但我遇到了在ColumnTransformer
中使用OneHotEncoder
的情况。所以没有 inverse_transform
函数,我的列数比编码前多。
问题
如果 model.feature_importances_.argmax()
说 --> 15
,我如何找出那是哪个源列?
您可以在列转换器的转换器上使用 get_feature_names_out()
并索引到该数组。对于您的示例,它看起来像
idx = model.feature_importances_.argmax()
cols = [col for t in full_pipeline.transformers_ for col in t[1].get_feature_names_out()]
name = col[idx]
请注意,对于一个热门列,您会得到像 columnname_categoryname 这样的名称,这可能比列名本身更有趣。