使用 group_by 和 summarise_all 为分类变量创建虚拟指标

Using group_by and summarise_all to create dummy indicators for categorical variable

我想为给定的分类变量 fruit 的每个 id 生成虚拟指标。使用 summarise_all 和自定义函数时,我观察到以下警告。我还尝试使用 summarise_all(any) 并且在将 double 强制转换为逻辑时它给了我警告。有没有有效或更新的方法来实现这个?非常感谢!

fruit = c("apple", "banana", "orange", "pear",
          "strawberry", "blueberry", "durian",
          "grape", "pineapple")
df_sample = data.frame(id = c(rep("a", 3), rep("b", 5), rep("c", 6)),
                       fruit = c(sample(fruit, replace = T, size = 3),
                                 sample(fruit, replace = T, size = 5),
                                 sample(fruit, replace = T, size = 6)))

fruit_indicator = 
  model.matrix(~ -1 + fruit, df_sample) %>%
  as.data.frame() %>%
  bind_cols(df_sample) %>%
  select(-fruit) %>%
  group_by(id) %>%
  summarise_all(funs(ifelse(any(. > 0), 1, 0)))


# Warning message:
#   `funs()` is deprecated as of dplyr 0.8.0.
# Please use a list of either functions or lambdas: 
#   
#   # Simple named list: 
#   list(mean = mean, median = median)
# 
#   # Auto named with `tibble::lst()`: 
#   tibble::lst(mean, median)
# 
#   # Using lambdas
#   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

您可以使用 across,它在 dplyr 1.0.0 或更高版本中可用。

library(dplyr)

model.matrix(~ -1 + fruit, df_sample) %>%
  as.data.frame() %>%
  bind_cols(df_sample) %>%
  select(-fruit) %>%
  group_by(id) %>%
  summarise(across(.fns = ~as.integer(any(. > 0))))

#  id    fruitapple fruitbanana fruitdurian fruitgrape fruitpear
#* <chr>      <int>       <int>       <int>      <int>     <int>
#1 a              0           1           1          0         1
#2 b              1           0           0          1         0
#3 c              0           1           0          1         1
# … with 1 more variable: fruitpineapple <int>