使用特定的因变量和自变量自动进行回归

Automate regression with specific dependent and independent variables

MVE: 让这是数据集:

data <- data.frame(year = rep(seq(1966,2015,1), 8), 
               county = c(rep('prva', 50), rep('druga', 50), rep('treća', 50), rep('četvrta', 50),
                          rep('peta', 50), rep('šesta', 50), rep('sedma', 50), rep('osma', 50)),
               crime1 = runif(400), crime2 = runif(400), crime3 = runif(400), 
               uvar1 = runif(400), uvar2 = runif(400), uvar3 = runif(400),
               var1 = runif(400), var2 = runif(400), var3 = runif(400), var4 = runif(400), var5 = runif(400))

假设犯罪 1、2 和 3 是特定的因变量。 uvar1,2 和 3 是特定的自变量。 var1,2 等是其他协变量。我想做的是自动化回归。

也就是说,我想得到这段代码的结果:

plm(log(crime1) = log(univar1) + log(var1) + log(var2) + log(var3) + log(var4), model = 'within', effect = 'twoways', data = data)

plm(log(crime2) = log(univar2) + log(var1) + log(var2) + log(var3) + log(var4), model = 'within', effect = 'twoways', data = data)

等;但无需为每个估计模型编写 20 行代码。

通过查看类似的问题,这是我所能达到的:

crime <- c('crime1', 'crime2', 'crime3')
plm.results <- lapply(data[, crime], function(y) plm(y ~ var1 + var2 + var3 + var4, 
                                                     model = 'within', effect ='twoways', data = data))

这当然对我的因变量有帮助,但我不知道如何在这些估计中的每一个中包含特定的自变量。再次澄清一下,我希望 univar1 处于第一个回归中,但不在其余回归中等

formula 函数在创建多组模型时很有用。你可以合并变体 使用 paste0formulalapply 的组合来遍历索引 1 到 3.

#remember to set.seed when sampling from distributions

set.seed(123)

#a helper function to create "log(var)" from "var"
fn_appendLog = function(x) {
 paste0("log(",x,")")
}



modelList = lapply(1:3,function(x) {


indepVars2 = Reduce(function(x,y) paste(x,y,sep="+"),lapply(colnames(regDF)[grepl("^v",colnames(regDF))],fn_appendLog))

#> indepVars2
#[1] "log(var1)+log(var2)+log(var3)+log(var4)+log(var5)"


indepVars1 = fn_appendLog(paste0("uvar",x))

depVar = fn_appendLog(paste0("crime",x))

formulaVar = formula(paste0(depVar, " ~ ",indepVars1,"+", indepVars2))

#> formulaVar
#log(crime1) ~ log(uvar1) + log(var1) + log(var2) + log(var3) +  log(var4) + log(var5)


modelObj = plm(formulaVar, model = 'within', effect = 'twoways', data = regDF)


})

总结:

summary(modelList[[1]])

#> summary(modelList[[1]])
#Twoways effects Within Model
#
#Call:
#plm(formula = formulaVar, data = regDF, effect = "twoways", model = "within")
#
#Balanced Panel: n=50, T=8, N=400
#
#Residuals :
#   Min. 1st Qu.  Median 3rd Qu.    Max. 
# -5.730  -0.396   0.116   0.599   1.520 
#
#Coefficients :
#             Estimate Std. Error t-value Pr(>|t|)
#log(uvar1)  0.0393871  0.0490891  0.8024   0.4229
#log(var1)  -0.0369356  0.0541029 -0.6827   0.4953
#log(var2)  -0.0455269  0.0543664 -0.8374   0.4030
#log(var3)   0.0150516  0.0520347  0.2893   0.7726
#log(var4)  -0.0034534  0.0441506 -0.0782   0.9377
#log(var5)  -0.0109038  0.0527446 -0.2067   0.8363
#
#Total Sum of Squares:    302.23
#Residual Sum of Squares: 300.6
#R-Squared:      0.0053896
#Adj. R-Squared: 0.0045407
#F-statistic: 0.304357 on 6 and 337 DF, p-value: 0.93448

解释:

自变量有两种类型,第一种uvar1和其他var1...varN

1) colnames(regDF)[grepl("^v",colnames(regDF))] 这将为我们提供所有变量的列表 在 regDF 中,匹配以字母 'v' 开头的模式,插入符号表示开始 字符串和 $ 作为字符串的结尾,这个阶段的输出是 c("var1","var2"...,"var5")

2) 我们需要这个变量向量的对数变体,因此我们通过 lapply 将它们传递给函数 fn_appendLog,结果是 list("log(var1)","log(var2)",...,"log(var5)")

的列表输出

3) 接下来,我们需要将这些变量转换为log(var1)+log(var2)...+log(var5)

4) 为此,我们使用函数 Reduce 和函数 paste(x,y,sep="+"),这需要 上面列表中的每个元素与相邻元素一起使用分隔符作为“+”

   step1 = (log(var1)+log(var2))
   step2 = (log(var1)+log(var2)) + log(var3)
   step3 = (log(var1)+log(var2)+log(var3))+ log(var4) and so on

5) 函数Reduce 将函数应用于列表并将输出聚合为单个向量 导致 log(var1)+log(var2)+log(var3)+log(var4)+log(var5)

的最终输出

乍一看这似乎令人生畏,但当您经常使用它们并探索示例时,它们 你们中的一部分人会不会 time.The 了解函数的最佳方法说 lapply 是从头到尾阅读 ?lapply 的文档并执行 列出示例,修改参数并熟悉。希望这能说明一些问题 根据您的查询。