基于 glmnet 包的最佳区分组的变量

Question

我最终想使用弹性网络回归方法找到一组最能区分三组（低、中、高）的蛋白质。

有可重现的示例代码：

tempcv <- cv.glmnet(x=as.matrix(iris[,-5]), y=iris[,5], family="multinomial", 
                    nfolds=20, alpha=0.5)
coefsMin <- coef(tempcv, s="lambda.min")

那么我得到的是：

$setosa
5 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept)  15.119192
Sepal.Length -1.897589
Sepal.Width   5.455627
Petal.Length -2.807969
Petal.Width  -5.942061

$versicolor
5 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept)   4.795799
Sepal.Length  1.726752
Sepal.Width   .       
Petal.Length -1.160588
Petal.Width  -1.978123

$virginica
5 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept)  -19.914991
Sepal.Length   .       
Sepal.Width   -3.925362
Petal.Length   4.536932
Petal.Width    9.236506

在这种情况下，使用每个系数的绝对值，我可以这样解释这个结果吗？
最能将“setosa”与其他两组（“versicolor”和“virginica”）区分开来的两个变量是 Sepal.Width (5.46) 和 Petal.Width (-5.94 ).

如果这是错误的，那么我如何select一些variables/features最能区分群体的东西？

非常感谢！！！

Answer 1

对于 glmnet，您获得的系数与输入的比例相同，来自 vignette:

Note that for family = "gaussian", glmnet standardizes to have unit variance before computing its lambda sequence (and then unstandardizes the resulting coefficients).

在您的示例中，自变量未按比例缩放。因此，系数的大小将取决于自变量的规模。例如，Sepal.Width 的系数意味着 Sepal.Width 的每个单位都会使对数几率增加 5.46。但是你可以看到它们的取值范围非常不同：

apply(iris[,1:4],2,range)
     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]          4.3         2.0          1.0         0.1
[2,]          7.9         4.4          6.9         2.5

只有当您在应用套索之前缩放自变量时，您的假设才成立。

一种选择是使用 vip to infer the variable importance, you can see this 来获取更多示例。

基于 glmnet 包的最佳区分组的变量

Variables that best discriminate groups based on the glmnet package

r

lasso-regression

glmnet