如何从多项回归模型中获得多个预测而不是焦点预测(即按因子变量拆分)

How to get multiple predictions rather than a focal prediction from multinomial regression model (i.e., split by factor variable)

我想 运行 多项式回归以获得每个选项的平均频率,除以一个因素(性别:male/female)。

背景

我想比较 4 种奶酪来衡量每种奶酪的受欢迎程度,共有 4 种可能性:切达干酪、马苏里拉干酪、豪达干酪和布里干酪。我出去询问 200 个人他们最喜欢的奶酪。每个人只能从 4 种类型中选择一种。我最终还收集了一些人口统计信息,包括性别、年龄和体重。

收集完数据后,我想看看每种奶酪的受欢迎程度(总和为100%)。由于我想控制 genderageweight,我认为这里适合使用多项式回归。

但我也很想知道男性和女性之间的结果有何不同,我想将 gender 作为一个因素包括在我的模型中。我如何根据我的(多项式)模型生成双重预测,从而分别获得女性和男性的预测值,以便我可以比较两个性别水平?

数据

library(truncnorm)
library(tidyverse)

set.seed(999)

cheese_df <-
  tibble(
    age = round(rtruncnorm(
      n = 200,
      a = 20,
      b = 80,
      mean = 25,
      sd = 25.09
    )),
    cheese_response = as_factor(sample(
      c("cheddar", "mozzarella", "gouda", "brie"),
      size = 200,
      replace = TRUE
    )),
    gender = sample(c(0, 1), size = 200, replace = TRUE),
    weight = rtruncnorm(
      n = 200,
      a = 40,
      b = 120,
      mean = 70,
      sd = 25.09
    )
  )


> cheese_df

## # A tibble: 200 x 4
##      age cheese_response gender weight
##    <dbl> <fct>            <dbl>  <dbl>
##  1    45 cheddar              0   62.2
##  2    32 cheddar              0   45.0
##  3    58 cheddar              1   87.6
##  4    28 brie                 0   68.8
##  5    49 gouda                0   88.2
##  6    29 brie                 1   74.5
##  7    49 cheddar              0   74.0
##  8    27 gouda                1   90.3
##  9    28 brie                 0   56.5
## 10    48 mozzarella           0   72.9
## # ... with 190 more rows

如果我只想 运行 对年龄、性别和体重进行多项回归和控制**而不按性别划分**我可以这样做:

library(nnet)
library(effects)


fit <- nnet::multinom(cheese_response ~ age + gender + weight, data = cheese_df)

average_person_for_control <-
  c(
    age = 50,
    gender = 0.5,
    weight = 75
  )

prediction <-
  effects::Effect("age",
                  fit,
                  given.values = average_person_for_control,
                  xlevels = list(age =
                                   c(45, 90)))


proportions_for_plot <-
  data.frame(prediction$prob, prediction$lower.prob, prediction$upper.prob) %>% 
  slice(1) %>%
  pivot_longer(., cols = everything(), 
               names_to = c(".value", "response"), 
               names_pattern = "(.*)\.(.*$)") %>%
  rename("lower_ci" = "L.prob",
         "upper_ci" = "U.prob",
         "estimate" = "prob")


ggplot(proportions_for_plot, aes(x = reorder(response, -estimate), y = estimate)) +
  geom_bar(stat = "identity", width = 0.7, fill = "darkgreen") +
  geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci),
                width = 0.2) +
  geom_text(aes(label = paste0(100*round(estimate,2), "%")),
            vjust = 1.6, 
            color = "white", size = 3) +
  xlab("cheese type") +
  ylab("proportion of people choosing this type")

但是,我对生成相同的条形图感兴趣,只是它会拆分男性和女性的条形图


这就是我想要得到的那种情节

(忽略此演示中的值)

一种方法是按性别对数据进行子集化,运行每个子集上使用相同的模型,生成两个条形图并将它们联合起来。但是,我想将 gender 作为一个因素合并到模型中,然后才输出拆分条形图。这是部分处理,因为 gender 已经是模型的一部分: fit <- nnet::multinom(cheese_response ~ age + gender + weight, data = cheese_df).

Still,就按 性别 拆分预测,以便在条形图中并排比较它们,我运行陷入困境。这是因为 effects::Effect() 在其 given.values 参数中只接受一个向量。否则,我会做类似下面的事情来提供预测(就像我使用 predict 时会做的那样):

control_by_gender <-
  expand.grid(
    age = 50,
    weight = 75,
    gender = c(0, 1)
  )

> control_by_gender

##   age weight gender
## 1  50     75      0
## 2  50     75      1

知道在处理如上所示的多项式模型对象时如何获得这样的多重(而不是焦点)预测吗?我的最终目标是按性别划分的条形图,就像上面的演示一样。我一直在使用 Effects::effect 来生成预测,但我愿意接受任何可以实现多重预测技巧的替代方法。

为什么不直接 lapply 级别进入 effects::Effect 调用


prediction <- do.call(rbind,lapply(0:1, function(x) {
    eff <- effects::Effect("age",
                  fit,
                  given.values =c(age = 50,
                        weight = 75,
                        gender = x),
                  xlevels = list(age =c(45, 90)))
    data.frame(level=x, eff$prob, eff$lower.prob, eff$upper.prob) %>% slice(1)
    }))


proportions_for_plot <-
  prediction %>% 
  pivot_longer(., cols = !level, 
               names_to = c(".value", "response"), 
               names_pattern = "(.*)\.(.*$)") %>%
  rename("lower_ci" = "L.prob",
         "upper_ci" = "U.prob",
         "estimate" = "prob")


ggplot(proportions_for_plot, aes(x = as.factor(response), y = estimate, fill=factor(level))) +
  geom_bar(stat = "identity", width = 0.7,position="dodge") +
  geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci), position=position_dodge(.9),
                width = 0.2) +
  geom_text(aes(label = paste0(100*round(estimate,2), "%")),
            vjust = 1.6, 
            color = "white", size = 3, position=position_dodge(.9)) +
  xlab("cheese type") +
  ylab("proportion of people choosing this type")

这个答案使用与@Abdessabour Mtk 相同的直觉,只是 purrr::map 和一些重构:

make_eff_df <- function(gender, fit) {
  Effect("age", fit, xlevels = list(age = c(45, 90)),
         given.values = c(age = 50, weight = 75, gender = gender)) %>%
    as_tibble() %>%
    mutate(gender = gender) %>%
    select(gender, matches("[a-z\.]?prob")) %>%
    slice(1)
}


map_dfr(0:1, make_eff_df, fit) %>% 
  pivot_longer(-gender, names_to = c(".value", "response"), 
               names_pattern = "(.+)\.(.+$)") %>%
  rename(lower_ci = "L.prob", upper_ci = "U.prob", estimate = "prob") %>%
  mutate(across(1:2, as.factor)) %>%
  ggplot(aes(x = reorder(response, -estimate), y = estimate, fill = gender)) +
  geom_bar(stat = "identity", width = 0.7, position = position_dodge(.9)) +
  geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci), 
                position = position_dodge(.9),
                width = 0.2) +
  geom_text(aes(label = scales::percent(estimate, accuracy = 1)),
            vjust = 1.6, color = "white", size = 3, position=position_dodge(.9)) +
  labs(x = "cheese type",
       y = "proportion of people choosing this type")