XGBoost 功能重要性:编码后如何获取原始变量名
XGBoost feature importance: How do I get original variable names after encoding
我正在按照 DataCamp 课程的指南使用 XGBoost 分类。数据处理如下:
X, y = df.iloc[:,:-1], df.iloc[:,-1]
# Create a boolean mask for categorical columns: check if df.dtypes == object
categorical_mask = (X.dtypes == object)
# Get list of categorical column names
categorical_columns = X.columns[categorical_mask].tolist()
# Create LabelEncoder object: le
le = LabelEncoder()
# Apply LabelEncoder to categorical columns
X[categorical_columns] = X[categorical_columns].apply(lambda x: le.fit_transform(x))
# Create OneHotEncoder: ohe
ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)
# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded is a NumPy array
X_encoded = ohe.fit_transform(X)
testy = pd.DataFrame(X_encoded)
X_train, X_test, y_train, y_test= train_test_split(testy, y, test_size=0.2, random_state=123)
DM_train = xgb.DMatrix(X_train, label = y_train, )
DM_test = xgb.DMatrix(X_test, label = y_test)
我使用交叉验证网格搜索调整了超参数,并使用 x_train
和 y_train
.
拟合模型
我用调整后的参数拟合模型,然后创建特征重要性图:
model.fit(X_train,y_train)
xgb.plot_importance(model, importance_type = 'gain')
这是输出:
如何将这些特征映射回原始数据?我很困惑,因为我同时使用了 LabelEncoder()
和 OneHotEncoder()
.
非常感谢任何帮助。
我改用了 DictVectorizer,它解决了问题:
X, y = df.iloc[:,:-1], df.iloc[:,-1]
# Import DictVectorizer
from sklearn.feature_extraction import DictVectorizer
# Convert df into a dictionary using .to_dict(): df_dict
df_dict = X.to_dict("records")
# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)
# Apply dv on df_dict: df_encoded
X_encoded = dv.fit_transform(df_dict)
X_encoded = pd.DataFrame(X_encoded)
X_train, X_test, y_train, y_test= train_test_split(X_encoded, y, test_size=0.2, random_state=123)
现在拟合模型并绘制特征重要性:
最后,您必须查找名称:
# Use pprint to make the vocabulary easier to read
import pprint
pprint.pprint(dv.vocabulary_)
如果有人知道如何使用字典词汇查找特征名称并将它们放在图中,我将非常感谢您的意见。
我正在按照 DataCamp 课程的指南使用 XGBoost 分类。数据处理如下:
X, y = df.iloc[:,:-1], df.iloc[:,-1]
# Create a boolean mask for categorical columns: check if df.dtypes == object
categorical_mask = (X.dtypes == object)
# Get list of categorical column names
categorical_columns = X.columns[categorical_mask].tolist()
# Create LabelEncoder object: le
le = LabelEncoder()
# Apply LabelEncoder to categorical columns
X[categorical_columns] = X[categorical_columns].apply(lambda x: le.fit_transform(x))
# Create OneHotEncoder: ohe
ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)
# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded is a NumPy array
X_encoded = ohe.fit_transform(X)
testy = pd.DataFrame(X_encoded)
X_train, X_test, y_train, y_test= train_test_split(testy, y, test_size=0.2, random_state=123)
DM_train = xgb.DMatrix(X_train, label = y_train, )
DM_test = xgb.DMatrix(X_test, label = y_test)
我使用交叉验证网格搜索调整了超参数,并使用 x_train
和 y_train
.
我用调整后的参数拟合模型,然后创建特征重要性图:
model.fit(X_train,y_train)
xgb.plot_importance(model, importance_type = 'gain')
这是输出:
如何将这些特征映射回原始数据?我很困惑,因为我同时使用了 LabelEncoder()
和 OneHotEncoder()
.
非常感谢任何帮助。
我改用了 DictVectorizer,它解决了问题:
X, y = df.iloc[:,:-1], df.iloc[:,-1]
# Import DictVectorizer
from sklearn.feature_extraction import DictVectorizer
# Convert df into a dictionary using .to_dict(): df_dict
df_dict = X.to_dict("records")
# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)
# Apply dv on df_dict: df_encoded
X_encoded = dv.fit_transform(df_dict)
X_encoded = pd.DataFrame(X_encoded)
X_train, X_test, y_train, y_test= train_test_split(X_encoded, y, test_size=0.2, random_state=123)
现在拟合模型并绘制特征重要性:
最后,您必须查找名称:
# Use pprint to make the vocabulary easier to read
import pprint
pprint.pprint(dv.vocabulary_)
如果有人知道如何使用字典词汇查找特征名称并将它们放在图中,我将非常感谢您的意见。