线性回归 - 使用 MinMaxScaler() 获取特征重要性 - 非常大的系数

Question

我正在尝试获取回归模型的特征重要性。我有 58 个自变量和 1 个因变量。大多数自变量是数值的，有些是二进制的。

首先我使用了这个：

X = dataset.drop(['y'], axis=1)
y = dataset[['y']]

# define the model
model = LinearRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_[0]
print(model.coef_)
print(importance)
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

得到如下结果： Feature Importance Plot

然后我在拟合模型之前使用 MinMaxScaler() 缩放数据：

scaler = MinMaxScaler()
dataset[dataset.columns] = scaler.fit_transform(dataset[dataset.columns])
print(dataset)

X = dataset.drop(['y'], axis=1)
y = dataset[['y']]

# define the model
model = LinearRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_[0]
print(model.coef_)
print(importance)
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

这导致了以下情节： Feature Importance Plot after using MinMaxScaler

可以看到左上角是1e11，表示最大值为负600亿。我在这里做错了什么？它甚至是使用 MinMaxScaler 的正确方法吗？

Answer 1

在回归分析中，系数的大小与其重要性不一定相关。在回归分析中确定自变量重要性的最常见标准是 p 值。小的 p 值意味着高水平的重要性，而高 p 值意味着变量在统计上不显着。当您的模型正在惩罚变量时，您应该只使用系数的大小作为特征重要性的度量。也就是说，当优化问题具有 L1 或 L2 惩罚时，如套索或岭回归。

sklearn 不报告 p 值。我推荐运行使用 statsmodels.OLS 进行相同的回归。对于所有其他模型，包括树、集成、神经网络等，您应该使用 feature_importances_ 来确定每个自变量的个体重要性。

通过使用 model.coef_ 作为特征重要性的度量，您只考虑了 beta 的大小。如果这确实是您感兴趣的，请尝试 numpy.abs(model.coef_[0])，因为贝塔也可能是负数。

至于您对min_max_scaler()的使用，您的使用是正确的。但是，您正在转换整个数据集，实际上，您只应该重新缩放自变量。

X = dataset.drop(['y'], axis=1)
y = dataset['y']
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
print(X)

通过使用 scaler.fit_transform(dataset[dataset.columns])，您重新调整了 dataset 对象中的所有列，包括因变量。事实上，您的代码等效于 scaler.fit_transform(dataset)，因为您选择了 dataset.

中的所有列

通常，如果您怀疑异常值正在影响您的估算器，您应该只重新缩放数据。通过重新缩放数据，beta 系数不再可解释（或至少不那么直观）。发生这种情况是因为给定的 beta 不再表示由相应自变量的边际变化引起的因变量的变化。

最后，这应该不是问题，但为了安全起见，请确保缩放器不会更改您的二进制自变量。

线性回归 - 使用 MinMaxScaler() 获取特征重要性 - 非常大的系数

Linear Regression - Get Feature Importance using MinMaxScaler() - Extremely large coefficients

python

plot

regression

feature-selection

scikit-learn