tbl_uvregression (gtsummary) 中许多结果的单变量回归自动化,return 格式化结果在 R 中

Automate univariate regressions for many outcomes in tbl_uvregression (gtsummary), return formatted results in R

我想使用 tbl_uvregression 函数(gtsummary 包,R),因为它可以创建保持协变量或结果常数的单变量回归模型。

在我的例子中,对于每个结果,我需要一个格式良好的 table 单变量回归结果,其中包含数据框中的每个变量,除了结果 variable.This 如果我将我的数据框子集化为仅包含一个结果和感兴趣的协变量,然后将其传递给 tbl_uvregression 函数。

但是,我需要帮助来弄清楚如何自动执行此过程,因为我有很多结果变量,并且对于每个结果变量,我想使用同一组协变量生成一个 table 单变量回归 -但不包括其他结果变量 - 并标记 tables 以便跟踪哪个 table 属于哪个结果变量。

我该怎么做?

# Libraries
library(gtsummary)
library(tidyverse)

# Data as well as a few artificial variables
data("iris")
my_iris <- as.data.frame(iris)

my_iris$out1 <- sample(c(0,1), 150, replace = TRUE)
my_iris$out2 <- sample(c(0,1), 150, replace = TRUE)
my_iris$out3 <- sample(c(0,1), 150, replace = TRUE)

# Extra variables below to simulate that the dataframe has extra covariates, 
# hence need to select those of interest.
my_iris$x1 <- sample(c(1:12), 150, replace = TRUE)
my_iris$x2 <- sample(c(50:100), 150, replace = TRUE)
my_iris$x3 <- sample(c(18:100), 150, replace = TRUE)


# List of outcome(*outcome*) and predictor(*preds*) variables I need to run univariate logistic regressions for.
outcome <- c("out1", "out2", "out3") # have a long list, but this is sufficient for demo
preds <- c("Species", "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") # same here


# To produce a nicely formatted table for a single outcome I can do:
my_iris %>% 
    dplyr::select(outcome[1], all_of(preds)) %>% 
    tbl_uvregression(method = glm,
                     y = outcome[1],
                     method.args = list(family = binomial),
                     exponentiate = TRUE) %>%
    bold_labels() %>% modify_caption(paste("Univariate Regression Model with", outcome[1], "as Outcome", sep = " "))

# How to automate production of above table for multiple outcomes?


我会使用 lapply 循环遍历这样的结果:

library(gtsummary)
library(tidyverse)

# Data as well as a few artificial variables
data("iris")
my_iris <- as.data.frame(iris)

my_iris$out1 <- sample(c(0,1), 150, replace = TRUE)
my_iris$out2 <- sample(c(0,1), 150, replace = TRUE)
my_iris$out3 <- sample(c(0,1), 150, replace = TRUE)

# Extra variables below to simulate that the dataframe has extra covariates, 
# hence need to select those of interest.
my_iris$x1 <- sample(c(1:12), 150, replace = TRUE)
my_iris$x2 <- sample(c(50:100), 150, replace = TRUE)
my_iris$x3 <- sample(c(18:100), 150, replace = TRUE)


# List of outcome(*outcome*) and predictor(*preds*) variables I need to run univariate logistic regressions for.
outcome <- c("out1", "out2", "out3") # have a long list, but this is sufficient for demo
preds <- c("Species", "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") # same here


# To produce a nicely formatted table for a single outcome I can do:
lapply(outcome, function(x){ 
my_iris %>% 
  dplyr::select(!!x, all_of(preds)) %>% 
  tbl_uvregression(method = glm,
                   y = !!x,
                   method.args = list(family = binomial),
                   exponentiate = TRUE) %>%
  bold_labels() %>% modify_caption(paste("Univariate Regression Model with", x, "as Outcome", sep = " "))
})

我建议使用几乎相同的方法(for 循环),而不使用 函数 并返回存储 tbl_uvregression 对象的列表。 两种方法的性能非常相似。

library(gtsummary)
library(microbenchmark)


### Alternative method using a for loop and a list to store tables ####



  mod.list<-list()
  loop.list<-list()
  
  for (i in seq_along(outcome)){
      
     loop.list[[i]]<-microbenchmark::microbenchmark( loop=                              
    mod.list[[i]]<-my_iris %>% 
      tbl_uvregression(method = glm,
                       y = substitute(i,list(i=as.name(outcome[i]))),
                       include=all_of(preds),
                       method.args = list(family = binomial),
                       exponentiate = TRUE) %>%
      bold_labels() %>% 
      modify_caption(paste("Univariate Regression Model with", outcome[i], "as Outcome", sep = " ")),
    times=5,unit='s')
  }
  loop.list
  
  
[[1]]
Unit: seconds
 expr      min      lq     mean   median       uq      max neval
 loop 5.323629 5.40431 6.014162 5.519318 5.642695 8.180858     5

[[2]]
Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 loop 5.717601 5.848664 6.077736 6.056062 6.265348 6.501003     5

[[3]]
Unit: seconds
 expr      min       lq     mean  median       uq     max neval
 loop 6.034187 6.038607 6.180724 6.04633 6.257016 6.52748     5

注意:当然可以删除存储微基准测试结果的脚本。



### Kevin´s method: using lapply ####

  
  lapply(outcome, function(x){ 
    microbenchmark::microbenchmark(loop=
                                     my_iris %>% 
                                     dplyr::select(!!x, all_of(preds)) %>% 
                                     tbl_uvregression(method = glm,
                                                      y = !!x,
                                                      method.args = list(family = binomial),
                                                      exponentiate = TRUE) %>%
                                     bold_labels() %>% modify_caption(paste("Univariate Regression Model with", x, "as Outcome", sep = " ")),
                                   times=5,unit='s')
  }
  )

[[1]]
Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 loop 5.389512 6.727198 7.112398 7.076198 7.845806 8.523274     5

[[2]]
Unit: seconds
 expr     min       lq     mean   median       uq      max neval
 loop 6.43759 6.444641 6.511864 6.467782 6.523884 6.685421     5

[[3]]
Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 loop 6.469109 6.562612 6.907325 6.764478 6.856771 7.883655     5

警告: 列表中的 tbl_uvregression 表格稍后可以自定义是很棒的,但也有缺点。 tbl_uvregression 对象相当大(即使在使用 tbl_butcher 之后),所以如果您有很多预测变量(即 52)和许多结果(即 21),该过程可能需要 40 分钟(每个模型大约两分钟,考虑到几乎准备好发布表,实际上并没有太多)并且列表太大而无法将它们保留在工作区图像中。老实说,我正在使用 survey::svyglm,但我不认为这是花了这么长时间的原因。

因此,如果您 运行 上面的代码,正如通常建议的那样(只有源代码是真实的,感谢 Julia Silge answering to Problems saving workspace in R),它渲染 rmd 文件需要花费太多时间(至少在配备 4core i7 1.8GHz 处理器的笔记本电脑中),但是如果要存储列表,这是不可能的,因为它是 1.2GB 大小(至少在配备 16GB 处理器的笔记本电脑中)内存)。

我还没有找到加快这个过程的方法,但在我看来 gtsummary::tbl_uvregression 仍然值得这样做。