如何找到逻辑回归模型特征的重要性？

Question

我有一个通过逻辑回归算法训练的二元预测模型。我想知道哪些特征（预测变量）对于正面或负面的决定更重要 class。我知道有 coef_ 参数来自 scikit-learn 包，但我不知道它是否足以说明重要性。另一件事是我如何根据负值和正值 class 的重要性来评估 coef_ 值。我也读过标准化回归系数，但我不知道它是什么。

假设有肿瘤大小、肿瘤重量等特征来判断恶性或非恶性等测试案例。我想知道哪些特征对于恶性和非恶性预测更重要。有没有道理？

Answer 1

要了解线性分类模型（逻辑就是其中之一）中给定参数的 "influence"，最简单的选择之一是考虑其系数乘以标准差的大小数据中对应的参数。

考虑这个例子：

import numpy as np    
from sklearn.linear_model import LogisticRegression

x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])

m = LogisticRegression()
m.fit(X, y)

# The estimated coefficients will all be around 1:
print(m.coef_)

# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)

获得类似结果的另一种方法是检查模型与标准化参数的拟合系数：

m.fit(X / np.std(X, 0), y)
print(m.coef_)

请注意，这是最基本的方法，并且存在许多用于查找特征重要性或参数影响的其他技术（使用 p 值、bootstrap 分数、各种 "discriminative indices" 等）。

我很确定您会在 https://stats.stackexchange.com/ 得到更多有趣的答案。

如何找到逻辑回归模型特征的重要性？

How to find the importance of the features for a logistic regression model?

python

machine-learning

scikit-learn

logistic-regression