使用 glmnet 为给定数量的预测变量找到优化模型

Question

我正在尝试将 LASSO 用于与最初设计的功能略有不同的功能。我在测试中有 22 项不同的任务，将其平均后得出最终分数。我想看看哪种有限数量的任务组合最能预测总分，并希望创建一个简短的测试形式。

接下来我正在使用 glmnet 运行套索，它运行符合预期。然后，我可以使用

在给定的 lamda 处轻松找到模型

coef(cvfit, s = s)

但是，我想知道是否可以指定 n 个具有非零系数的预测变量，而不是惩罚参数？

我已经设置了一种非常低效的方法来执行此操作，如下所示，通过从测试 lambda 网格中提取模型，但我想知道是否有更有效的方法来执行此操作

nvar <- list()
coeffs <- list()

for(j in 1:20000) {

  s <- j / 20000

  coeffs[j] <- coef(cvfit, s = s) ##Get coefficient list at given lamda

  nvar[j] <- sum(as.vector(coef(cvfit, s = s)) != 0) - 1 ##Count number of variables with non-zero coeff and subtract one because intercept is always non-zero

}

nvar <- unlist(nvar)

getlamda <- function(numvar = 4) {

  min.lambda <- min(lambdas[nvar == numvar]) / 20000 ##Find the smallest lambda which resulted in the given number of non-zero coefficients

  coeffs[min.lambda]

}

Answer 1

您可以使用 rowSums().

(boston <- MASS::Boston %>% tbl_df())
#> # A tibble: 506 x 14
#>       crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
#>  *   <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl>   <dbl>
#>  1 0.00632  18    2.31     0 0.538  6.58  65.2  4.09     1   296    15.3
#>  2 0.0273    0    7.07     0 0.469  6.42  78.9  4.97     2   242    17.8
#>  3 0.0273    0    7.07     0 0.469  7.18  61.1  4.97     2   242    17.8
#>  4 0.0324    0    2.18     0 0.458  7.00  45.8  6.06     3   222    18.7
#>  5 0.0690    0    2.18     0 0.458  7.15  54.2  6.06     3   222    18.7
#>  6 0.0298    0    2.18     0 0.458  6.43  58.7  6.06     3   222    18.7
#>  7 0.0883   12.5  7.87     0 0.524  6.01  66.6  5.56     5   311    15.2
#>  8 0.145    12.5  7.87     0 0.524  6.17  96.1  5.95     5   311    15.2
#>  9 0.211    12.5  7.87     0 0.524  5.63 100    6.08     5   311    15.2
#> 10 0.170    12.5  7.87     0 0.524  6.00  85.9  6.59     5   311    15.2
#> # ... with 496 more rows, and 3 more variables: black <dbl>, lstat <dbl>,
#> #   medv <dbl>

对于上述数据集 (Boston housing)，考虑 medv ~ ..

library(glmnet)
tr_x <- model.matrix(medv ~ ., data = boston)[,-1]
tr_y <- boston$medv
cvfit <- glmnet(tr_x, tr_y)

head(t(coef(cvfit)))
    #> 6 x 14 sparse Matrix of class "dgCMatrix"
    #>    [[ suppressing 14 column names '(Intercept)', 'crim', 'zn' ... ]]
    #>                                                        
    #> s0 22.53281 . . . . . .         . . . . . .  .         
    #> s1 23.60072 . . . . . .         . . . . . . -0.08439977
    #> s2 23.67264 . . . . . 0.1278413 . . . . . . -0.15358093
    #> s3 21.44649 . . . . . 0.5694424 . . . . . . -0.19698136
    #> s4 19.42057 . . . . . 0.9714620 . . . . . . -0.23654740
    #> s5 17.57464 . . . . . 1.3377669 . . . . . . -0.27259852

我想你已经完成了这个程序。

备注

转置系数矩阵可能更方便，使每个变量成为每一列。
对于 t(coef(cvfit))，rowSums(t(coef(cvfit)) != 0) 计算每个变量的非零元素的数量。
接下来，我们将numvar与这个rowSums进行匹配，求出系数的值。

表示从 s0 到 s5，lambda s0 大于 s5 - 惩罚更多。

head(cvfit$lambda)
#> [1] 6.777654 6.175546 5.626927 5.127046 4.671574 4.256564

用 numvar 子集 coef

基于这些事实，

get_nparam <- function(mod, numvar) {
  beta <- coef(mod)
  non_zero <- rowSums(t(beta)[,-1] != 0) # ignore intercept
  min_lam <- which(non_zero == numvar) # numvar non-zero coef
  t(beta)[dplyr::last(min_lam),] # last index = smallest lambda
}

通过这个函数，可以得到

get_nparam(cvfit, 4)
#>  (Intercept)         crim           zn        indus         chas 
#> 15.468034114  0.000000000  0.000000000  0.000000000  0.000000000 
#>          nox           rm          age          dis          rad 
#>  0.000000000  3.816165372  0.000000000  0.000000000  0.000000000 
#>          tax      ptratio        black        lstat 
#>  0.000000000 -0.606026131  0.001518042 -0.495954410

rm、ptratio、black 和 lstat 为非零，而其他为零。

Answer 2

在使用了上面的 Blended 解决方案之后，我意识到有一种更简单的方法可以做到这一点。

使用示例中使用的波士顿数据集：

library(dplyr)
library(glmnet)

(boston <- MASS::Boston %>% tbl_df())

tr_x <- model.matrix(medv ~ ., data = boston)[,-1]
tr_y <- boston$medv
cvfit <- glmnet(tr_x, tr_y)

cvfit 对象已经包含我们为给定数量的变量找到答案所需的所有组件。 df是自由度的个数，是我们感兴趣的可变参数的个数。 lambda 是每个模型的 lambda。

所以我们可以创建一个简单的函数，returns 给定数量的变量的最佳模型。

get_nparam <- function(mod, numvar) {

  coef(mod, s = with(cvfit, min(lambda[df == numvar])))

}

get_nparam(cvfit, 4)

#14 x 1 sparse Matrix of class "dgCMatrix"
#                       1
#(Intercept) 15.468034114
#crim         .          
#zn           .          
#indus        .          
#chas         .          
#nox          .          
#rm           3.816165372
#age          .          
#dis          .          
#rad          .          
#tax          .          
#ptratio     -0.606026131
#black        0.001518042
#lstat       -0.495954410
#

再次感谢 Blender 提供了不同的解决方案，让我走上了这条道路。

使用 glmnet 为给定数量的预测变量找到优化模型

Using glmnet to find optimized model for a given number of predictors

statistics

r

glmnet

备注

用 numvar 子集 coef