如何创建按 2 个不同列分组的累计和

How to create a cumulative sum that is grouped by 2 different columns

假设我有以下数据框,初始化为 df:

ID     date     value
A      01/2012  1
A      03/2012  2
A      05/2012  4
A      07/2012  3
A      09/2012  7
A      11/2012  1
A      01/2013  2
A      03/2013  8
A      05/2013  13
A      07/2013  2
A      09/2013  5
A      11/2013  2
B      01/2012  3
B      03/2012  9
B      05/2012  1
B      07/2012  0
B      09/2012  12
B      11/2012  3
B      01/2013  1
B      03/2013  4
B      05/2013  3
B      07/2013  3
B      09/2013  1
B      11/2013  1

其中日期变量的格式为 mm/yyyy。我正在寻找一种在此数据框中创建附加列的方法,该列提供按 ID 和年份分组的值列的累计总和。例如,下面的输出是我想要的:

ID     date     value  cumsum
A      01/2012  1      1
A      03/2012  2      3
A      05/2012  4      7
A      07/2012  3      10
A      09/2012  7      17
A      11/2012  1      18
A      01/2013  2      2
A      03/2013  8      10
A      05/2013  13     23
A      07/2013  2      25
A      09/2013  5      30
A      11/2013  2      32
B      01/2012  3      3
B      03/2012  9      12
B      05/2012  1      13
B      07/2012  0      13
B      09/2012  12     25
B      11/2012  3      28
B      01/2013  1      1
B      03/2013  4      5
B      05/2013  3      8
B      07/2013  3      11
B      09/2013  1      12
B      11/2013  1      13

如您所见,每年都会重置总和,每个 ID 也会重置。本质上,我不确定如何创建一个由 2 列而不是 1 列分组的累计和。任何帮助将不胜感激

你可以这样做:

df %>%
  group_by(ID, year = substr(date, 4,8)) %>%
  mutate(cumsum = cumsum(value))

# A tibble: 24 x 5
# Groups:   ID, year [4]
   ID    date    value year  cumsum
   <chr> <chr>   <int> <chr>  <int>
 1 A     01/2012     1 2012       1
 2 A     03/2012     2 2012       3
 3 A     05/2012     4 2012       7
 4 A     07/2012     3 2012      10
 5 A     09/2012     7 2012      17
 6 A     11/2012     1 2012      18
 7 A     01/2013     2 2013       2
 8 A     03/2013     8 2013      10
 9 A     05/2013    13 2013      23
10 A     07/2013     2 2013      25

我们转换为 Date class 并提取 year 部分以创建分组列,然后再进行累加和

library(dplyr)
library(lubridate)
df1 %>% 
    group_by(ID, year = year(my(date))) %>%
     mutate(cumsum = cumsum(value)) %>% 
     ungroup %>% 
     select(-year)
# A tibble: 24 x 4
   ID    date    value cumsum
   <chr> <chr>   <int>  <int>
 1 A     01/2012     1      1
 2 A     03/2012     2      3
 3 A     05/2012     4      7
 4 A     07/2012     3     10
 5 A     09/2012     7     17
 6 A     11/2012     1     18
 7 A     01/2013     2      2
 8 A     03/2013     8     10
 9 A     05/2013    13     23
10 A     07/2013     2     25
# … with 14 more rows

数据

df1 <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B"), date = c("01/2012", "03/2012", "05/2012", "07/2012", 
"09/2012", "11/2012", "01/2013", "03/2013", "05/2013", "07/2013", 
"09/2013", "11/2013", "01/2012", "03/2012", "05/2012", "07/2012", 
"09/2012", "11/2012", "01/2013", "03/2013", "05/2013", "07/2013", 
"09/2013", "11/2013"), value = c(1L, 2L, 4L, 3L, 7L, 1L, 2L, 
8L, 13L, 2L, 5L, 2L, 3L, 9L, 1L, 0L, 12L, 3L, 1L, 4L, 3L, 3L, 
1L, 1L)), class = "data.frame", row.names = c(NA, -24L))

可选方案:

library(data.tabe)
setDT(df)[,cumsum:=cumsum(value),by=.(ID,substr(date, 4,8))][]

为了完成这里是一个基本的 R 选项 -

transform(df, cumsum = ave(value, ID, sub('.*/', '', date), FUN = cumsum))

#   ID    date value cumsum
#1   A 01/2012     1      1
#2   A 03/2012     2      3
#3   A 05/2012     4      7
#4   A 07/2012     3     10
#5   A 09/2012     7     17
#6   A 11/2012     1     18
#7   A 01/2013     2      2
#8   A 03/2013     8     10
#9   A 05/2013    13     23
#10  A 07/2013     2     25
#11  A 09/2013     5     30
#12  A 11/2013     2     32
#13  B 01/2012     3      3
#14  B 03/2012     9     12
#15  B 05/2012     1     13
#16  B 07/2012     0     13
#17  B 09/2012    12     25
#18  B 11/2012     3     28
#19  B 01/2013     1      1
#20  B 03/2013     4      5
#21  B 05/2013     3      8
#22  B 07/2013     3     11
#23  B 09/2013     1     12
#24  B 11/2013     1     13