如何有效地运行 R 中的许多逻辑回归并跳过抛出错误的方程式？

Question

作为此的延续，我想一次运行许多逻辑回归方程，然后注意一组是否与参考组有显着差异。此解决方案有效，但仅在我不缺少值时才有效。由于我的数据有 100 个方程，它必然会有缺失值，所以与其让这个解决方案在遇到错误时失败，不如我如何对其进行编程以跳过引发错误的实例？

这是一个修改后的数据集，缺少个案：

library(dplyr)
library(lubridate)
library(broom)

test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
               time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
               admit = c(500, 1000, 450, 800, 300, 100, 1000),
               reject = c(1000, 300, 1000, 210, 100, 900, 1500)) %>%
  mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
  mutate(accept_rate = admit/total)

这是在所有情况下都有效的解决方案（参见数据集），但是当它遇到缺少情况的 2019 分组时，它会失败：

library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)

test %>% 
  # create year column
  mutate(year = year(time), 
         major = relevel(major, "undeclared")) %>% 
  
  # nest by year
  nest(data = -year) %>% 
  
  # compute regression
  mutate(reg = map(data, ~glm(accept_rate ~ major, data = ., 
                              family = binomial, weights = total, na.action = na.exclude)), 
         
         # use broom::tidy to make a tibble out of model object
         reg_tidy = map(reg, tidy)) %>% 
  
  # get data and regression results back to tibble form
  unnest(c(data, reg_tidy)) %>% 
  filter(term != "(Intercept)") %>%
  
  # create the significant yes/no column
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>% 
  
  # remove the unnecessary columns
  select(-c(term, estimate, std.error, statistic, p.value, reg))

我也尝试使用自定义函数包装解决方案，但我也无法让它工作。最后，如果它产生类似的输出并且能够抵抗这些错误，我也愿意接受其他解决方案的想法。

Answer 1

purrr::safely 允许处理错误。为了将 glm 调用包装在 purrr::safely 中，我使用了辅助函数 glm_safe。 glm_safe returns 包含两个元素的列表，result 和 error。

没有错误时，result包含模型对象，而element是NULL。如果出现错误，错误消息将存储在 error 中，并且 result 为 NULL。

要在您的管道中使用结果，我们必须提取 result 可以通过 transpose(reg)$result 实现的元素。

library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)

test <- tibble(
  major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
  time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
  admit = c(500, 1000, 450, 800, 300, 100, 1000),
  reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)

test <- test %>%
  mutate(total = rowSums(test[, c("admit", "reject")], na.rm = TRUE)) %>%
  mutate(accept_rate = admit / total)

glm_safe <- purrr::safely(
  function(x) {
    glm(accept_rate ~ major,
      data = x,
      family = binomial, weights = total, na.action = na.exclude
    )
  }
)

test %>%
  # create year column
  mutate(
    year = year(time),
    major = relevel(major, "undeclared")
  ) %>%
  # nest by year
  nest(data = -year) %>%
  # compute regression
  mutate(reg = map(data, glm_safe),
         reg = transpose(reg)$result) |> 
  mutate(reg_tidy = map(reg, tidy)) %>%
  # get data and regression results back to tibble form
  unnest(c(data, reg_tidy)) %>%
  filter(term != "(Intercept)") %>%
  # create the significant yes/no column
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
  # remove the unnecessary columns
  select(-c(term, estimate, std.error, statistic, p.value, reg))
#> # A tibble: 4 × 8
#>    year major            time       admit reject total accept_rate significant
#>   <dbl> <fct>            <date>     <dbl>  <dbl> <dbl>       <dbl> <chr>      
#> 1  2021 computer science 2021-01-01  1000    300  1300       0.769 Yes        
#> 2  2021 english          2021-01-01   450   1000  1450       0.310 No         
#> 3  2020 computer science 2020-01-01   300    100   400       0.75  No         
#> 4  2020 english          2020-01-01   100    900  1000       0.1   Yes

Answer 2

要忽略错误，请使用此函数：

get_model <- function(df) {
  tryCatch(
    glm(accept_rate ~ major, data = df, family = binomial, weights = total, na.action = na.exclude),
    error = function(e) NULL, warning=function(w) NULL)
}

在你调用的地方使用它 mutate(reg=map()...):

  # compute regression
  mutate(reg = map(data, get_model), reg_tidy = map(reg, tidy))

输出：

# A tibble: 4 x 8
   year major            time       admit reject total accept_rate significant
  <dbl> <fct>            <date>     <dbl>  <dbl> <dbl>       <dbl> <chr>      
1  2021 computer science 2021-01-01  1000    300  1300       0.769 Yes        
2  2021 english          2021-01-01   450   1000  1450       0.310 No         
3  2020 computer science 2020-01-01   300    100   400       0.75  No         
4  2020 english          2020-01-01   100    900  1000       0.1   Yes

如何有效地 运行 R 中的许多逻辑回归并跳过抛出错误的方程式？

How to efficiently run many logistic regressions in R and skip over equations that throw errors?

r

glm

tidyr

broom

purrr

如何有效地运行 R 中的许多逻辑回归并跳过抛出错误的方程式？