如何有效地 运行 R 中的许多逻辑回归并跳过抛出错误的方程式?
How to efficiently run many logistic regressions in R and skip over equations that throw errors?
作为此 的延续,我想一次 运行 许多逻辑回归方程,然后注意一组是否与参考组有显着差异。此解决方案有效,但仅在我不缺少值时才有效。由于我的数据有 100 个方程,它必然会有缺失值,所以与其让这个解决方案在遇到错误时失败,不如我如何对其进行编程以跳过引发错误的实例?
这是一个修改后的数据集,缺少个案:
library(dplyr)
library(lubridate)
library(broom)
test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
这是在所有情况下都有效的解决方案(参见数据集 ),但是当它遇到缺少情况的 2019 分组时,它会失败:
library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)
test %>%
# create year column
mutate(year = year(time),
major = relevel(major, "undeclared")) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, ~glm(accept_rate ~ major, data = .,
family = binomial, weights = total, na.action = na.exclude)),
# use broom::tidy to make a tibble out of model object
reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
我也尝试使用自定义函数 包装解决方案,但我也无法让它工作。最后,如果它产生类似的输出并且能够抵抗这些错误,我也愿意接受其他解决方案的想法。
purrr::safely
允许处理错误。为了将 glm
调用包装在 purrr::safely
中,我使用了辅助函数 glm_safe
。 glm_safe
returns 包含两个元素的列表,result
和 error
。
没有错误时,result
包含模型对象,而element
是NULL
。如果出现错误,错误消息将存储在 error
中,并且 result
为 NULL
。
要在您的管道中使用结果,我们必须提取 result
可以通过 transpose(reg)$result
实现的元素。
library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test <- test %>%
mutate(total = rowSums(test[, c("admit", "reject")], na.rm = TRUE)) %>%
mutate(accept_rate = admit / total)
glm_safe <- purrr::safely(
function(x) {
glm(accept_rate ~ major,
data = x,
family = binomial, weights = total, na.action = na.exclude
)
}
)
test %>%
# create year column
mutate(
year = year(time),
major = relevel(major, "undeclared")
) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, glm_safe),
reg = transpose(reg)$result) |>
mutate(reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
#> # A tibble: 4 × 8
#> year major time admit reject total accept_rate significant
#> <dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2021 computer science 2021-01-01 1000 300 1300 0.769 Yes
#> 2 2021 english 2021-01-01 450 1000 1450 0.310 No
#> 3 2020 computer science 2020-01-01 300 100 400 0.75 No
#> 4 2020 english 2020-01-01 100 900 1000 0.1 Yes
要忽略错误,请使用此函数:
get_model <- function(df) {
tryCatch(
glm(accept_rate ~ major, data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
在你调用的地方使用它 mutate(reg=map()...)
:
# compute regression
mutate(reg = map(data, get_model), reg_tidy = map(reg, tidy))
输出:
# A tibble: 4 x 8
year major time admit reject total accept_rate significant
<dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <chr>
1 2021 computer science 2021-01-01 1000 300 1300 0.769 Yes
2 2021 english 2021-01-01 450 1000 1450 0.310 No
3 2020 computer science 2020-01-01 300 100 400 0.75 No
4 2020 english 2020-01-01 100 900 1000 0.1 Yes
作为此
这是一个修改后的数据集,缺少个案:
library(dplyr)
library(lubridate)
library(broom)
test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
这是在所有情况下都有效的解决方案(参见数据集
library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)
test %>%
# create year column
mutate(year = year(time),
major = relevel(major, "undeclared")) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, ~glm(accept_rate ~ major, data = .,
family = binomial, weights = total, na.action = na.exclude)),
# use broom::tidy to make a tibble out of model object
reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
我也尝试使用自定义函数
purrr::safely
允许处理错误。为了将 glm
调用包装在 purrr::safely
中,我使用了辅助函数 glm_safe
。 glm_safe
returns 包含两个元素的列表,result
和 error
。
没有错误时,result
包含模型对象,而element
是NULL
。如果出现错误,错误消息将存储在 error
中,并且 result
为 NULL
。
要在您的管道中使用结果,我们必须提取 result
可以通过 transpose(reg)$result
实现的元素。
library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test <- test %>%
mutate(total = rowSums(test[, c("admit", "reject")], na.rm = TRUE)) %>%
mutate(accept_rate = admit / total)
glm_safe <- purrr::safely(
function(x) {
glm(accept_rate ~ major,
data = x,
family = binomial, weights = total, na.action = na.exclude
)
}
)
test %>%
# create year column
mutate(
year = year(time),
major = relevel(major, "undeclared")
) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, glm_safe),
reg = transpose(reg)$result) |>
mutate(reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
#> # A tibble: 4 × 8
#> year major time admit reject total accept_rate significant
#> <dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2021 computer science 2021-01-01 1000 300 1300 0.769 Yes
#> 2 2021 english 2021-01-01 450 1000 1450 0.310 No
#> 3 2020 computer science 2020-01-01 300 100 400 0.75 No
#> 4 2020 english 2020-01-01 100 900 1000 0.1 Yes
要忽略错误,请使用此函数:
get_model <- function(df) {
tryCatch(
glm(accept_rate ~ major, data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
在你调用的地方使用它 mutate(reg=map()...)
:
# compute regression
mutate(reg = map(data, get_model), reg_tidy = map(reg, tidy))
输出:
# A tibble: 4 x 8
year major time admit reject total accept_rate significant
<dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <chr>
1 2021 computer science 2021-01-01 1000 300 1300 0.769 Yes
2 2021 english 2021-01-01 450 1000 1450 0.310 No
3 2020 computer science 2020-01-01 300 100 400 0.75 No
4 2020 english 2020-01-01 100 900 1000 0.1 Yes