glmnet 的公式接口

Formula interface for glmnet

在过去的几个月里,我参与了很多项目,在这些项目中我使用了 glmnet 包来拟合弹性网络模型。它很棒,但与大多数 R 建模函数相比,界面相当简陋。特别是,您必须给出响应向量和预测矩阵,而不是指定公式和数据框。您还会失去常规界面提供的许多生活质量方面的东西,例如对因素的合理(?)处理、缺失值、将变量放入正确的顺序等。

所以我通常最终会编写自己的代码来重新创建 formula/data 框架界面。由于客户保密问题,我最终也留下了这段代码,不得不为下一个项目重新编写。我想我还是硬着头皮创建一个实际的包来做这件事。但是,在我这样做之前有几个问题:

嗯,好像没有预建的公式界面,所以我继续自己制作。您可以从 Github 下载它:https://github.com/Hong-Revo/glmnetUtils

或者在 R 中,使用 devtools::install_github:

install.packages("devtools")
library(devtools)
install_github("hong-revo/glmnetUtils")
library(glmnetUtils)

来自自述文件:

Some quality-of-life functions to streamline the process of fitting elastic net models with glmnet, specifically:

  • glmnet.formula provides a formula/data frame interface to glmnet.
  • cv.glmnet.formula does a similar thing for cv.glmnet.
  • Methods for predict and coef for both the above.
  • A function cvAlpha.glmnet to choose both the alpha and lambda parameters via cross-validation, following the approach described in the help page for cv.glmnet. Optionally does the cross-validation in parallel.
  • Methods for plot, predict and coef for the above.

顺便说一下,在写以上内容时,我想我意识到了为什么以前没有人这样做过。 R 处理模型框架和模型矩阵的核心是一个 terms 对象,它包括一个矩阵,每个变量一行,每个主效应和交互作用一列。实际上,这(至少)大致是一个 p x p 矩阵,其中 p 是数字模型中的变量。当 p 为 16000 时,这在当今宽数据中很常见,生成的矩阵大小约为 1 GB。

不过,我在使用这些对象时(还)没有遇到任何问题。如果它成为一个主要问题,我会看看是否可以找到解决方法。


2016 年 10 月更新

我已将更新推送到存储库,以解决上述问题以及与因素相关的问题。来自文档:

There are two ways in which glmnetUtils can generate a model matrix out of a formula and data frame. The first is to use the standard R machinery comprising model.frame and model.matrix; and the second is to build the matrix one variable at a time. These options are discussed and contrasted below.

Using model.frame

This is the simpler option, and the one that is most compatible with other R modelling functions. The model.frame function takes a formula and data frame and returns a model frame: a data frame with special information attached that lets R make sense of the terms in the formula. For example, if a formula includes an interaction term, the model frame will specify which columns in the data relate to the interaction, and how they should be treated. Similarly, if the formula includes expressions like exp(x) or I(x^2) on the RHS, model.frame will evaluate these expressions and include them in the output.

The major disadvantage of using model.frame is that it generates a terms object, which encodes how variables and interactions are organised. One of the attributes of this object is a matrix with one row per variable, and one column per main effect and interaction. At minimum, this is (approximately) a p x p square matrix where p is the number of main effects in the model. For wide datasets with p > 10000, this matrix can approach or exceed a gigabyte in size. Even if there is enough memory to store such an object, generating the model matrix can take a significant amount of time.

Another issue with the standard R approach is the treatment of factors. Normally, model.matrix will turn an N-level factor into an indicator matrix with N-1 columns, with one column being dropped. This is necessary for unregularised models as fit with lm and glm, since the full set of N columns is linearly dependent. With the usual treatment contrasts, the interpretation is that the dropped column represents a baseline level, while the coefficients for the other columns represent the difference in the response relative to the baseline.

This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.

Manually building the model matrix

To deal with the problems above, glmnetUtils by default will avoid using model.frame, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be noticeably faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually).

The main downside of not using model.frame is that the formula can only be relatively simple. At the moment, only straightforward formulas like y ~ x1 + x2 + ... + x_p are handled by the code, where the x's are columns already present in the data. Interaction terms and computed expressions are not supported. Where possible, you should compute such expressions beforehand.


2017 年 4 月更新

几经波折,终于on CRAN.