如何用fct_lump()按组取前n级，其余放在'other'？

Question

我试图根据聚合变量在每个组中找到前 3 个因子水平，并将剩余的因子水平分组为每个组的“其他”。通常我会为此使用 fct_lump_n，但我不知道如何让它在每个组中工作。这是一个例子，我想根据 x 变量形成组，根据 z 的值对 y 变量排序，选择前 3 个 y 变量，然后将其余的 y 分组到“其他”：

set.seed(50)
df <- tibble(x = factor(sample(letters[18:20], 100, replace = T)),
             y = factor(sample(letters[1:10], 100, replace = T)),
             z = sample(100, 100, replace = T))

我试过这样做：

df %>%
  group_by(x) %>%
  arrange(desc(z), .by_group = T) %>%
  slice_head(n = 3)

哪个returns这个：

# A tibble: 9 x 3
# Groups:   x [3]
  x     y         z
  <fct> <fct> <int>
1 r     i        95
2 r     c        92
3 r     a        88
4 s     g        94
5 s     g        92
6 s     f        92
7 t     j       100
8 t     d        93
9 t     i        81

这基本上是我想要的，但我在 r、s 和 t 中缺少 'other' 变量，它收集尚未计算的 z 值。

我可以为此使用 fct_lump_n 吗？或者 slice_head 结合将排除的变量分组为“其他”？

Answer 1

在 R 4.0.0 和 tidyverse 1.3.0 中尝试过：

set.seed(50)
df <- tibble(x = factor(sample(letters[18:20], 100, replace = T)),
             y = factor(sample(letters[1:10], 100, replace = T)),
             z = sample(100, 100, replace = T))

df %>%
  group_by(x) %>%
  arrange(desc(z)) %>%
  mutate(a = row_number(-z)) %>%
  mutate(y = case_when(a > 3 ~ "Other", TRUE ~ as.character(y))) %>%
  mutate(a = case_when(a > 3 ~ "Other", TRUE ~ as.character(a))) %>%
  group_by(x, y, a) %>%
  summarize(z = sum(z)) %>%
  arrange(x, a) %>%
  select(-a)

输出：

# A tibble: 12 x 3
# Groups:   x, y [11]
   x     y         z
   <fct> <chr> <int>
 1 r     b        92
 2 r     j        89
 3 r     g        83
 4 r     Other   749
 5 s     i        93
 6 s     h        93
 7 s     i        84
 8 s     Other  1583
 9 t     a        99
10 t     b        98
11 t     i        95
12 t     Other  1508

注意：变量a和y的使用是为了补偿y是通过替换采样的事实（见第5行和第7行的输出）。如果我不使用 a，输出的第 5 行和第 7 行将汇总 z。另请注意，我试图解决所提出的问题，但我将 y 保留为字符，因为我认为那些“其他”并不意味着是一个相同的因素水平。

如何用fct_lump()按组取前n级，其余放在'other'？

How to use fct_lump() to get the top n levels by group and put the rest in 'other'?

r

plyr

dplyr

r-factor