使用 sklearn 恢复 StandardScaler().fit_transform() 的特征名称

Question

编辑自 a tutorial in Kaggle, I try to run the code below and data (available to download from here):

代码：

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt  # for plotting facilities
from datetime import datetime, date
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("./data/Aquifer_Petrignano.csv")

df['Date'] = pd.to_datetime(df.Date, format = '%d/%m/%Y')
df = df[df.Rainfall_Bastia_Umbra.notna()].reset_index(drop=True)

df = df.interpolate(method ='ffill')
df = df[['Date', 'Rainfall_Bastia_Umbra', 'Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25', 'Temperature_Bastia_Umbra', 'Temperature_Petrignano', 'Volume_C10_Petrignano', 'Hydrometry_Fiume_Chiascio_Petrignano']].resample('7D', on='Date').mean().reset_index(drop=False)

X = df.drop(['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25','Date'], axis=1)
y1 = df.Depth_to_Groundwater_P24
y2 = df.Depth_to_Groundwater_P25

scaler = StandardScaler()
X = scaler.fit_transform(X)

model = xgb.XGBRegressor()
param_search = {'max_depth': range(1, 2, 2),
                'min_child_weight': range(1, 2, 2),
                'n_estimators' : [1000],
                'learning_rate' : [0.1]}

tscv = TimeSeriesSplit(n_splits=2)
gsearch = GridSearchCV(estimator=model, cv=tscv,
                        param_grid=param_search)
gsearch.fit(X, y1)

xgb_grid = xgb.XGBRegressor(**gsearch.best_params_)
xgb_grid.fit(X, y1)

ax = xgb.plot_importance(xgb_grid)
ax.figure.tight_layout()
ax.figure.savefig('test.png')

y_val = y1[-80:]
X_val = X[-80:]

y_pred = xgb_grid.predict(X_val)
print(mean_absolute_error(y_val, y_pred))
print(math.sqrt(mean_squared_error(y_val, y_pred)))

我绘制了一个隐藏了原始特征名称的特征重要性图：

如果我注释掉这两行：

scaler = StandardScaler()
X = scaler.fit_transform(X)

我得到输出：

如何将 scaler.fit_transform() 用于 X 并获得包含原始特征名称的特征重要性图？

Answer 1

这背后的原因是 StandardScaler returns 一个 numpy.ndarray 的特征值（与 pandas.DataFrame.values 形状相同，但未归一化）并且您需要转换它返回到具有相同列名的 pandas.DataFrame。

这是您的代码中需要更改的部分。

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

使用 sklearn 恢复 StandardScaler().fit_transform() 的特征名称

Recovering features names of StandardScaler().fit_transform() with sklearn

machine-learning

python-3.x

scikit-learn

xgboost