自动变量选择——回归线性模型

Automatic variable selection – Regression linear model

在下面的 MWE 中,我有一个包含 70 个潜在预测变量的数据集来解释我的变量 price1。我想对所有变量进行单变量分析,但包 glmulti 说我有 too many predictors。单变量分析怎么会有太多预测变量?

*我可以通过 loop/apply 来完成,但我正在寻找更详细的内容。这个类似的问题here也没有解决问题。

test <- read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Car.csv"))
library(glmulti)
glmulti.lm.out <- glmulti(data  = test, price1 ~ .,
                          level = 1,
                          method = "h",
                          maxK = 1,
                          confsetsize = 10,
                          fitfunction = "lm")

Error
Warning message:
In glmulti(y = "price1", data = test, level = 1, maxK = 1, method = "h",  :
  !Too many predictors.

This question is more geared for CrossValidated, but here's my two cents. 运行 an exhaustive search to find the best variables to include in a model is very computationally heavy and gets out of hand really quickly. Consider what you're asking the computer to do:

When you're 运行ning an exhaustive search, the computer is building a model for every possible combination of variables. For a model of size one, that's not too bad because that's only 70 models. But even for a two variable model, the computer has to 运行 n!/(r!(n-r)!) = 70!/(2!(68)!) = 2415 different models. Things spiral out of control from there.

As a work-around, I'll point you to the leaps package, which has the regsubsets function. Then, you can 运行 either a Forward or a Backward subset selection model and find the most important variables in a step-wise manner. After 运行ning each, you may be able to toss out the variables that are omitted from each and 运行 your model with fewer predictors using glmulti, but no promises.

test.data <-
read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/ma
ster/csv/Ecdat/Car.csv"))[,2:71]
library(leaps)

big_subset_model <- regsubsets(x = price1 ~ ., data = test.data, nbest = 1, 
method = "forward", really.big = TRUE, nvmax = 70)
sum.model <- summary(big_subset_model)

使用 lapply 进行 univariate 分析的简单解决方案。

test <- read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Car.csv")) 

reg <- function(indep_var,dep_var,data_source) {
          formula <- as.formula(paste(dep_var," ~ ", indep_var))
          res     <- lm(formula, data = data_source)
          summary(res)
}

lapply(colnames(test), FUN = reg, dep_var = "price1", data_source = test)