使用 dplyr 分组数据中的 cumsum

Question

我有一个数据框 df（可以下载 here），指的是看起来像这样的公司注册：

    Provider.ID        Local.Authority month year entry exit total
1  1-102642676           Warwickshire    10 2010     2    0     2
2  1-102642676                   Bury    10 2010     1    0     1
3  1-102642676                   Kent    10 2010     1    0     1
4  1-102642676                  Essex    10 2010     1    0     1
5  1-102642676                Lambeth    10 2010     2    0     2
6  1-102642676            East Sussex    10 2010     5    0     5
7  1-102642676       Bristol, City of    10 2010     1    0     1
8  1-102642676              Liverpool    10 2010     1    0     1
9  1-102642676                 Merton    10 2010     1    0     1
10 1-102642676          Cheshire East    10 2010     2    0     2
11 1-102642676               Knowsley    10 2010     1    0     1
12 1-102642676        North Yorkshire    10 2010     1    0     1
13 1-102642676   Kingston upon Thames    10 2010     1    0     1
14 1-102642676               Lewisham    10 2010     1    0     1
15 1-102642676              Wiltshire    10 2010     1    0     1
16 1-102642676              Hampshire    10 2010     1    0     1
17 1-102642676             Wandsworth    10 2010     1    0     1
18 1-102642676                  Brent    10 2010     1    0     1
19 1-102642676            West Sussex    10 2010     1    0     1
20 1-102642676 Windsor and Maidenhead    10 2010     1    0     1
21 1-102642676                  Luton    10 2010     1    0     1
22 1-102642676                Enfield    10 2010     1    0     1
23 1-102642676               Somerset    10 2010     1    0     1
24 1-102642676         Cambridgeshire    10 2010     1    0     1
25 1-102642676             Hillingdon    10 2010     1    0     1
26 1-102642676               Havering    10 2010     1    0     1
27 1-102642676               Solihull    10 2010     1    0     1
28 1-102642676                 Bexley    10 2010     1    0     1
29 1-102642676               Sandwell    10 2010     1    0     1
30 1-102642676            Southampton    10 2010     1    0     1
31 1-102642676               Trafford    10 2010     1    0     1
32 1-102642676                 Newham    10 2010     1    0     1
33 1-102642676         West Berkshire    10 2010     1    0     1
34 1-102642676                Reading    10 2010     1    0     1
35 1-102642676             Hartlepool    10 2010     1    0     1
36 1-102642676              Hampshire     3 2011     1    0     1
37 1-102642676                   Kent     9 2011     0    1    -1
38 1-102642676        North Yorkshire    12 2011     0    1    -1
39 1-102642676         North Somerset    12 2012     2    0     2
40 1-102642676                   Kent    10 2014     1    0     1
41 1-102642676               Somerset     1 2016     0    1    -1

我的目标是创建一个变量来反映每个 Local.Authority 和每个 year 的最后一个变量 (total) 的累积总和。 total 就是 entry 和 exit 的区别。我试图通过在以下基础上应用 dplyr 来执行此操作：

library(dplyr)
 df.1 = df %>% group_by(Local.Authority, year) %>%
  mutate(cum.total = cumsum(total)) %>%
  arrange(year, month, Local.Authority)

产生以下（错误）结果：

> df.1
Source: local data frame [41 x 8]
Groups: Local.Authority, year [41]

   Provider.ID  Local.Authority month  year entry  exit total cum.total
        <fctr>           <fctr> <int> <int> <int> <int> <int>     <int>
1  1-102642676           Bexley    10  2010     1     0     1        35
2  1-102642676            Brent    10  2010     1     0     1        25
3  1-102642676 Bristol, City of    10  2010     1     0     1        13
4  1-102642676             Bury    10  2010     1     0     1         3
5  1-102642676   Cambridgeshire    10  2010     1     0     1        31
6  1-102642676    Cheshire East    10  2010     2     0     2        17
7  1-102642676      East Sussex    10  2010     5     0     5        12
8  1-102642676          Enfield    10  2010     1     0     1        29
9  1-102642676            Essex    10  2010     1     0     1         5
10 1-102642676        Hampshire    10  2010     1     0     1        23
..         ...              ...   ...   ...   ...   ...   ...       ...

我已经通过检查出现在不同年份（例如肯特）的变量 Local.Authority 中的水平来确认这些结果：

> check = df.1 %>% filter(Local.Authority == "Kent")
> check
Source: local data frame [3 x 8]
Groups: Local.Authority, year [3]

  Provider.ID Local.Authority month  year entry  exit total cum.total
       <fctr>          <fctr> <int> <int> <int> <int> <int>     <int>
1 1-102642676            Kent    10  2010     1     0     1         4
2 1-102642676            Kent     9  2011     0     1    -1        42
3 1-102642676            Kent    10  2014     1     0     1        44

它应该在的位置：

Provider.ID Local.Authority month  year entry  exit total cum.total
       <fctr>          <fctr> <int> <int> <int> <int> <int>     <int>
1 1-102642676            Kent    10  2010     1     0     1         1
2 1-102642676            Kent     9  2011     0     1    -1         0
3 1-102642676            Kent    10  2014     1     0     1         1

有人知道从 cumsum 中得到这些结果会发生什么吗？非常感谢。

Answer 1

当您按 local.Authority 和年份分组时，它采用唯一值并将结果打印为 1,-1,1 因此最好仅按 local.Authority 分组，其中 cumsum 基于总值和结果1,0,1

 df <- df %>%
      group_by(Local.Authority) %>%
      mutate(cum.to = cumsum(total))

    > df
    Source: local data frame [3 x 8]
    Groups: Local.Authority [1]

      Provider.ID Local.Authority month  year entry  exit total cum.to
            <chr>           <chr> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
    1 1-102642676            Kent    10  2010     1     0     1      1
    2 1-102642676            Kent     9  2011     0     1    -1      0
    3 1-102642676            Kent    10  2014     1     0     1      1

Answer 2

我找到了问题的解决方案。我重新启动了我的会话，我得到了我的结果，只按地方当局分组，然后安排：

> df.1 = df %>% group_by(Local.Authority) %>%
+ mutate(cum.total = cumsum(total)) %>%
+ arrange(year, month, Local.Authority)
> df.1
Source: local data frame [41 x 8]
Groups: Local.Authority [36]

   Provider.ID  Local.Authority month  year entry  exit total cum.total
        <fctr>           <fctr> <int> <int> <int> <int> <int>     <int>
1  1-102642676           Bexley    10  2010     1     0     1         1
2  1-102642676            Brent    10  2010     1     0     1         1
3  1-102642676 Bristol, City of    10  2010     1     0     1         1
4  1-102642676             Bury    10  2010     1     0     1         1
5  1-102642676   Cambridgeshire    10  2010     1     0     1         1
6  1-102642676    Cheshire East    10  2010     2     0     2         2
7  1-102642676      East Sussex    10  2010     5     0     5         5
8  1-102642676          Enfield    10  2010     1     0     1         1
9  1-102642676            Essex    10  2010     1     0     1         1
10 1-102642676        Hampshire    10  2010     1     0     1         1

检查 "Kent" 现在它产生了预期的结果：

> check = df.1 %>% filter(Local.Authority == "Kent")
> check
Source: local data frame [3 x 8]
Groups: Local.Authority [1]

  Provider.ID Local.Authority month  year entry  exit total cum.total
       <fctr>          <fctr> <int> <int> <int> <int> <int>     <int>
1 1-102642676            Kent    10  2010     1     0     1         1
2 1-102642676            Kent     9  2011     0     1    -1         0
3 1-102642676            Kent    10  2014     1     0     1         1

使用 dplyr 分组数据中的 cumsum

cumsum in grouped data with dplyr

r

cumsum

dplyr