R超前滞后函数在组内总结并计算百分比

R lead lag function summarize within group and calculate percent

这是我的数据框的样子:

这是它的输出结构。

structure(list(tier_1 = c("Organic Search", "Organic Search", 
"Organic Search", "Organic Search", "Organic Search", "Organic Search", 
"Organic Search", "Organic Search", "Organic Search", "Organic Search", 
"Organic Social", "Organic Social", "Organic Social", "Organic Social", 
"Organic Social", "Organic Social", "Organic Social", "Paid Search", 
"Paid Search", "Paid Search", "Paid Search", "Paid Search", "Paid Search", 
"Paid Search", "Paid Search", "Paid Search", "Paid Social", "Paid Social", 
"Paid Social", "Paid Social", "Paid Social", "Paid Social", "Paid Social", 
"Paid Social", "Paid Social"), sequence_number = c(1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L), count_of_sequence_numbers = c(1176L, 460L, 119L, 41L, 21L, 
5L, 8L, 6L, 2L, 1L, 133L, 52L, 11L, 2L, 2L, 1L, 1L, 7516L, 1090L, 
284L, 90L, 36L, 21L, 12L, 6L, 2L, 1979L, 674L, 99L, 30L, 11L, 
2L, 3L, 2L, 1L), percent = c(0.637744034707158, 0.249457700650759, 
0.0645336225596529, 0.022234273318872, 0.0113882863340564, 0.0027114967462039, 
0.00433839479392625, 0.00325379609544469, 0.00108459869848156, 
0.000542299349240781, 0.655172413793103, 0.25615763546798, 0.0541871921182266, 
0.00985221674876847, 0.00985221674876847, 0.00492610837438424, 
0.00492610837438424, 0.827662151745402, 0.120030833608633, 0.0312740887567449, 
0.00991080277502478, 0.00396432111000991, 0.00231252064750578, 
0.0013214403700033, 0.000660720185001652, 0.000220240061667217, 
0.704019921736037, 0.23977232301672, 0.0352187833511206, 0.0106723585912487, 
0.00391319815012451, 0.000711490572749911, 0.00106723585912487, 
0.000711490572749911, 0.000355745286374956)), row.names = c(NA, 
-35L), groups = structure(list(tier_1 = c("Organic Search", "Organic Social", 
"Paid Search", "Paid Social"), .rows = structure(list(1:10, 11:17, 
    18:26, 27:35), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

df <- df %>% 
  group_by(tier_1, sequence_number) %>%
  summarize(count_of_sequence_numbers = length(sequence_number)) %>%
  mutate(percent = count_of_sequence_numbers / sum(count_of_sequence_numbers)) %>%
  filter(sequence_number <= 10)

我能够使用上面的代码得出百分比列,特别是关于 count / sum(count) 的部分。

不过我确实有一个问题,那就是百分比不正确。当引用 sequence_number = 2 时,count_of_sequence_numbers 中的值应从 sequence_number = 1 中的 count_of_sequence_numbers 中的值中减去(在同一类别中)。当 sequence_number = 3 时,count_of_sequence_numbers 中的所有内容都应在 sequence_number = 2 和 sequence_number = 3 时从 count_of_sequence_numbers 中减去。

我的意思是,我真的需要一个序列号的计数,对于 sequence_number = 1,不包括 2-10,当它是 2 时,不包括 3- 10 等。1176 值实际上应该是 1176 - 460 - 119 - 41 - 21 - 5 -8 - 6 -2 -1。而 460 值应该是 460 - 119 - 41 - 21 - 5 -8 - 6 -2 -1。然后应根据该百分比计算百分比。

我尝试了引导函数,但我认为这不是有效的方法。 :/ 那个 -1175 数字特别让我紧张。

df <- df %>%
    group_by(tier_1) %>%
    arrange(tier_1, sequence_number) %>%
    mutate(diff = count_of_sequence_numbers - lead(count_of_sequence_numbers, default = first(count_of_sequence_numbers)))

如果我改为 lead(count_of_sequence_numbers, default = 0)) 我会得到更好的行为,但它仍然不是我想要做的,即用所有的总和减去值同一组中的其他人具有更大的序列号。

这是您要查找的输出吗?

df %>%
  arrange(tier_1, -sequence_number) %>%
  group_by(tier_1) %>%   # already grouped this way, only including for clarity
  mutate(cuml = cumsum(lag(count_of_sequence_numbers, default = 0)),
         diff = count_of_sequence_numbers - cuml) %>%
  ungroup()


## A tibble: 35 x 6
#   tier_1         sequence_number count_of_sequence_numbers  percent  cuml  diff
#   <chr>                    <int>                     <int>    <dbl> <dbl> <dbl>
# 1 Organic Search              10                         1 0.000542     0     1
# 2 Organic Search               9                         2 0.00108      1     1
# 3 Organic Search               8                         6 0.00325      3     3
# 4 Organic Search               7                         8 0.00434      9    -1
# 5 Organic Search               6                         5 0.00271     17   -12
# 6 Organic Search               5                        21 0.0114      22    -1
# 7 Organic Search               4                        41 0.0222      43    -2
# 8 Organic Search               3                       119 0.0645      84    35
# 9 Organic Search               2                       460 0.249      203   257
#10 Organic Search               1                      1176 0.638      663   513
## … with 25 more rows