自动变量选择——回归线性模型
Automatic variable selection – Regression linear model
在下面的 MWE 中,我有一个包含 70 个潜在预测变量的数据集来解释我的变量 price1
。我想对所有变量进行单变量分析,但包 glmulti
说我有 too many predictors
。单变量分析怎么会有太多预测变量?
*我可以通过 loop
/apply
来完成,但我正在寻找更详细的内容。这个类似的问题here也没有解决问题。
test <- read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Car.csv"))
library(glmulti)
glmulti.lm.out <- glmulti(data = test, price1 ~ .,
level = 1,
method = "h",
maxK = 1,
confsetsize = 10,
fitfunction = "lm")
Error
Warning message:
In glmulti(y = "price1", data = test, level = 1, maxK = 1, method = "h", :
!Too many predictors.
This question is more geared for CrossValidated, but here's my two cents. 运行 an exhaustive search to find the best variables to include in a model is very computationally heavy and gets out of hand really quickly. Consider what you're asking the computer to do:
When you're 运行ning an exhaustive search, the computer is building a model for every possible combination of variables. For a model of size one, that's not too bad because that's only 70 models. But even for a two variable model, the computer has to 运行 n!/(r!(n-r)!) = 70!/(2!(68)!) = 2415 different models. Things spiral out of control from there.
As a work-around, I'll point you to the leaps
package, which has the regsubsets
function. Then, you can 运行 either a Forward or a Backward subset selection model and find the most important variables in a step-wise manner. After 运行ning each, you may be able to toss out the variables that are omitted from each and 运行 your model with fewer predictors using glmulti
, but no promises.
test.data <-
read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/ma
ster/csv/Ecdat/Car.csv"))[,2:71]
library(leaps)
big_subset_model <- regsubsets(x = price1 ~ ., data = test.data, nbest = 1,
method = "forward", really.big = TRUE, nvmax = 70)
sum.model <- summary(big_subset_model)
使用 lapply 进行 univariate
分析的简单解决方案。
test <- read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Car.csv"))
reg <- function(indep_var,dep_var,data_source) {
formula <- as.formula(paste(dep_var," ~ ", indep_var))
res <- lm(formula, data = data_source)
summary(res)
}
lapply(colnames(test), FUN = reg, dep_var = "price1", data_source = test)
在下面的 MWE 中,我有一个包含 70 个潜在预测变量的数据集来解释我的变量 price1
。我想对所有变量进行单变量分析,但包 glmulti
说我有 too many predictors
。单变量分析怎么会有太多预测变量?
*我可以通过 loop
/apply
来完成,但我正在寻找更详细的内容。这个类似的问题here也没有解决问题。
test <- read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Car.csv"))
library(glmulti)
glmulti.lm.out <- glmulti(data = test, price1 ~ .,
level = 1,
method = "h",
maxK = 1,
confsetsize = 10,
fitfunction = "lm")
Error
Warning message:
In glmulti(y = "price1", data = test, level = 1, maxK = 1, method = "h", :
!Too many predictors.
This question is more geared for CrossValidated, but here's my two cents. 运行 an exhaustive search to find the best variables to include in a model is very computationally heavy and gets out of hand really quickly. Consider what you're asking the computer to do:
When you're 运行ning an exhaustive search, the computer is building a model for every possible combination of variables. For a model of size one, that's not too bad because that's only 70 models. But even for a two variable model, the computer has to 运行 n!/(r!(n-r)!) = 70!/(2!(68)!) = 2415 different models. Things spiral out of control from there.
As a work-around, I'll point you to the leaps
package, which has the regsubsets
function. Then, you can 运行 either a Forward or a Backward subset selection model and find the most important variables in a step-wise manner. After 运行ning each, you may be able to toss out the variables that are omitted from each and 运行 your model with fewer predictors using glmulti
, but no promises.
test.data <-
read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/ma
ster/csv/Ecdat/Car.csv"))[,2:71]
library(leaps)
big_subset_model <- regsubsets(x = price1 ~ ., data = test.data, nbest = 1,
method = "forward", really.big = TRUE, nvmax = 70)
sum.model <- summary(big_subset_model)
使用 lapply 进行 univariate
分析的简单解决方案。
test <- read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Car.csv"))
reg <- function(indep_var,dep_var,data_source) {
formula <- as.formula(paste(dep_var," ~ ", indep_var))
res <- lm(formula, data = data_source)
summary(res)
}
lapply(colnames(test), FUN = reg, dep_var = "price1", data_source = test)