使用来自大型数据帧的单个预测变量在单个响应的 R 中执行线性模型,并为每一列重复
Performing a linear model in R of a single response with a single predictor from a large dataframe and repeat for each column
从标题看可能不是很清楚,但我想做的是:
我有一个数据框 df,比如说,200 列,前 80 列是响应变量(y1,y2,y3,...),其余 120 列是预测变量(x1, x2, x3, ...).
我想为每对计算一个线性模型 – lm(yi ~ xi, data = df)
。
我在网上浏览过的许多问题和解决方案要么有一个固定的响应,要么有许多预测变量,或者相反,使用 lapply()
及其相关函数。
熟悉它的人能指出我正确的步骤吗?
使用tidyverse
library(tidyverse)
library(broom)
df <- mtcars
y <- names(df)[1:3]
x <- names(df)[4:7]
result <- expand_grid(x, y) %>%
rowwise() %>%
mutate(frm = list(reformulate(x, y)),
model = list(lm(frm, data = df)))
result$model <- purrr::set_names(result$model, nm = paste0(result$y, " ~ ", result$x))
result$model[1:2]
#> $`mpg ~ hp`
#>
#> Call:
#> lm(formula = frm, data = df)
#>
#> Coefficients:
#> (Intercept) hp
#> 30.09886 -0.06823
#>
#>
#> $`cyl ~ hp`
#>
#> Call:
#> lm(formula = frm, data = df)
#>
#> Coefficients:
#> (Intercept) hp
#> 3.00680 0.02168
map_df(result$model, tidy)
#> # A tibble: 24 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.1 1.63 18.4 6.64e-18
#> 2 hp -0.0682 0.0101 -6.74 1.79e- 7
#> 3 (Intercept) 3.01 0.425 7.07 7.41e- 8
#> 4 hp 0.0217 0.00264 8.23 3.48e- 9
#> 5 (Intercept) 21.0 32.6 0.644 5.25e- 1
#> 6 hp 1.43 0.202 7.08 7.14e- 8
#> 7 (Intercept) -7.52 5.48 -1.37 1.80e- 1
#> 8 drat 7.68 1.51 5.10 1.78e- 5
#> 9 (Intercept) 14.6 1.58 9.22 2.93e-10
#> 10 drat -2.34 0.436 -5.37 8.24e- 6
#> # ... with 14 more rows
map_df(result$model, glance)
#> # A tibble: 12 x 12
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.602 0.589 3.86 45.5 1.79e- 7 1 -87.6 181. 186.
#> 2 0.693 0.683 1.01 67.7 3.48e- 9 1 -44.6 95.1 99.5
#> 3 0.626 0.613 77.1 50.1 7.14e- 8 1 -183. 373. 377.
#> 4 0.464 0.446 4.49 26.0 1.78e- 5 1 -92.4 191. 195.
#> 5 0.490 0.473 1.30 28.8 8.24e- 6 1 -52.7 111. 116.
#> 6 0.504 0.488 88.7 30.5 5.28e- 6 1 -188. 382. 386.
#> 7 0.753 0.745 3.05 91.4 1.29e-10 1 -80.0 166. 170.
#> 8 0.612 0.599 1.13 47.4 1.22e- 7 1 -48.3 103. 107.
#> 9 0.789 0.781 57.9 112. 1.22e-11 1 -174. 355. 359.
#> 10 0.175 0.148 5.56 6.38 1.71e- 2 1 -99.3 205. 209.
#> 11 0.350 0.328 1.46 16.1 3.66e- 4 1 -56.6 119. 124.
#> 12 0.188 0.161 114. 6.95 1.31e- 2 1 -196. 398. 402.
#> # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
由 reprex package (v0.3.0)
于 2020-12-11 创建
从标题看可能不是很清楚,但我想做的是:
我有一个数据框 df,比如说,200 列,前 80 列是响应变量(y1,y2,y3,...),其余 120 列是预测变量(x1, x2, x3, ...).
我想为每对计算一个线性模型 –
lm(yi ~ xi, data = df)
。我在网上浏览过的许多问题和解决方案要么有一个固定的响应,要么有许多预测变量,或者相反,使用
lapply()
及其相关函数。
熟悉它的人能指出我正确的步骤吗?
使用tidyverse
library(tidyverse)
library(broom)
df <- mtcars
y <- names(df)[1:3]
x <- names(df)[4:7]
result <- expand_grid(x, y) %>%
rowwise() %>%
mutate(frm = list(reformulate(x, y)),
model = list(lm(frm, data = df)))
result$model <- purrr::set_names(result$model, nm = paste0(result$y, " ~ ", result$x))
result$model[1:2]
#> $`mpg ~ hp`
#>
#> Call:
#> lm(formula = frm, data = df)
#>
#> Coefficients:
#> (Intercept) hp
#> 30.09886 -0.06823
#>
#>
#> $`cyl ~ hp`
#>
#> Call:
#> lm(formula = frm, data = df)
#>
#> Coefficients:
#> (Intercept) hp
#> 3.00680 0.02168
map_df(result$model, tidy)
#> # A tibble: 24 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.1 1.63 18.4 6.64e-18
#> 2 hp -0.0682 0.0101 -6.74 1.79e- 7
#> 3 (Intercept) 3.01 0.425 7.07 7.41e- 8
#> 4 hp 0.0217 0.00264 8.23 3.48e- 9
#> 5 (Intercept) 21.0 32.6 0.644 5.25e- 1
#> 6 hp 1.43 0.202 7.08 7.14e- 8
#> 7 (Intercept) -7.52 5.48 -1.37 1.80e- 1
#> 8 drat 7.68 1.51 5.10 1.78e- 5
#> 9 (Intercept) 14.6 1.58 9.22 2.93e-10
#> 10 drat -2.34 0.436 -5.37 8.24e- 6
#> # ... with 14 more rows
map_df(result$model, glance)
#> # A tibble: 12 x 12
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.602 0.589 3.86 45.5 1.79e- 7 1 -87.6 181. 186.
#> 2 0.693 0.683 1.01 67.7 3.48e- 9 1 -44.6 95.1 99.5
#> 3 0.626 0.613 77.1 50.1 7.14e- 8 1 -183. 373. 377.
#> 4 0.464 0.446 4.49 26.0 1.78e- 5 1 -92.4 191. 195.
#> 5 0.490 0.473 1.30 28.8 8.24e- 6 1 -52.7 111. 116.
#> 6 0.504 0.488 88.7 30.5 5.28e- 6 1 -188. 382. 386.
#> 7 0.753 0.745 3.05 91.4 1.29e-10 1 -80.0 166. 170.
#> 8 0.612 0.599 1.13 47.4 1.22e- 7 1 -48.3 103. 107.
#> 9 0.789 0.781 57.9 112. 1.22e-11 1 -174. 355. 359.
#> 10 0.175 0.148 5.56 6.38 1.71e- 2 1 -99.3 205. 209.
#> 11 0.350 0.328 1.46 16.1 3.66e- 4 1 -56.6 119. 124.
#> 12 0.188 0.161 114. 6.95 1.31e- 2 1 -196. 398. 402.
#> # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
由 reprex package (v0.3.0)
于 2020-12-11 创建