如何有效地 运行 R 中的许多逻辑回归并确定哪些组有显着差异?
How to efficiently run many logistic regressions in R and determine which groups are significantly different?
我有一个纵向数据集,其中包含不同申报专业的人的录取数据。在每个时间点(2021 年、2020 年等),我想看看未申报个人的接受率是否与已申报个人有显着差异(在任一方向)。
我最终会将这些结果输入到一个图中,如果组之间存在显着差异,该图中会显示一个星号,但我想知道是否有一种有效的方法来执行这些逻辑回归,所以我在我的表明该组是否与同一时间点未申报的学生有显着差异的数据集。
为了说明,这里有一个测试数据集:
library(dplyr)
library(lubridate)
test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 3))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 3))),
admit = c(500, 1000, 450, 800, 300, 100, 1000, 400, 150),
reject = c(1000, 300, 1000, 210, 100, 900, 1500, 350, 1200)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE),
accept_rate = admit/total)
这是我手动执行每个回归的方式(但不希望这样)
test$major <- relevel(test$major , ref = "undeclared")
just_2021 <- test %>%
filter(time == '2021-01-01')
m_2021 <- glm(accept_rate ~ major, data = just_2021, weights = total, family = binomial)
summary(m_2021) #english not sig diff from undeclared; CS is sig diff from undeclared
最后,这就是我希望我的数据集的样子:
library(dplyr)
library(lubridate)
answer <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 3))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 3))),
admit = c(500, 1000, 450, 800, 300, 100, 1000, 400, 150),
reject = c(1000, 300, 1000, 210, 100, 900, 1500, 350, 1200)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE),
accept_rate = admit/total) %>%
mutate(dif_than_undeclared_2021 = c(NA_character_, "Yes", "No", rep(NA_character_, 6)),
dif_than_undeclared_2020 = c(rep(NA_character_, 4), "Yes", "Yes", rep(NA_character_, 3)),
dif_than_undeclared_2019 = c(rep(NA_character_, 7), "Yes", "Yes"))
answer
我知道 purrr
可以帮助迭代,但我不知道它是否适用于这种情况。如有任何帮助,我们将不胜感激!
library(broom)
library(tidyr)
library(dplyr)
test %>%
# create year column
mutate(year = year(time),
major = relevel(major, "undeclared")) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, ~glm(accept_rate ~ major, data = .,
family = binomial, weights = total)),
# use broom::tidy to make a tibble out of model object
reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
我有一个纵向数据集,其中包含不同申报专业的人的录取数据。在每个时间点(2021 年、2020 年等),我想看看未申报个人的接受率是否与已申报个人有显着差异(在任一方向)。
我最终会将这些结果输入到一个图中,如果组之间存在显着差异,该图中会显示一个星号,但我想知道是否有一种有效的方法来执行这些逻辑回归,所以我在我的表明该组是否与同一时间点未申报的学生有显着差异的数据集。
为了说明,这里有一个测试数据集:
library(dplyr)
library(lubridate)
test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 3))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 3))),
admit = c(500, 1000, 450, 800, 300, 100, 1000, 400, 150),
reject = c(1000, 300, 1000, 210, 100, 900, 1500, 350, 1200)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE),
accept_rate = admit/total)
这是我手动执行每个回归的方式(但不希望这样)
test$major <- relevel(test$major , ref = "undeclared")
just_2021 <- test %>%
filter(time == '2021-01-01')
m_2021 <- glm(accept_rate ~ major, data = just_2021, weights = total, family = binomial)
summary(m_2021) #english not sig diff from undeclared; CS is sig diff from undeclared
最后,这就是我希望我的数据集的样子:
library(dplyr)
library(lubridate)
answer <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 3))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 3))),
admit = c(500, 1000, 450, 800, 300, 100, 1000, 400, 150),
reject = c(1000, 300, 1000, 210, 100, 900, 1500, 350, 1200)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE),
accept_rate = admit/total) %>%
mutate(dif_than_undeclared_2021 = c(NA_character_, "Yes", "No", rep(NA_character_, 6)),
dif_than_undeclared_2020 = c(rep(NA_character_, 4), "Yes", "Yes", rep(NA_character_, 3)),
dif_than_undeclared_2019 = c(rep(NA_character_, 7), "Yes", "Yes"))
answer
我知道 purrr
可以帮助迭代,但我不知道它是否适用于这种情况。如有任何帮助,我们将不胜感激!
library(broom)
library(tidyr)
library(dplyr)
test %>%
# create year column
mutate(year = year(time),
major = relevel(major, "undeclared")) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, ~glm(accept_rate ~ major, data = .,
family = binomial, weights = total)),
# use broom::tidy to make a tibble out of model object
reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))