使用 dplyr 分组数据中的 cumsum
cumsum in grouped data with dplyr
我有一个数据框 df
(可以下载 here),指的是看起来像这样的公司注册:
Provider.ID Local.Authority month year entry exit total
1 1-102642676 Warwickshire 10 2010 2 0 2
2 1-102642676 Bury 10 2010 1 0 1
3 1-102642676 Kent 10 2010 1 0 1
4 1-102642676 Essex 10 2010 1 0 1
5 1-102642676 Lambeth 10 2010 2 0 2
6 1-102642676 East Sussex 10 2010 5 0 5
7 1-102642676 Bristol, City of 10 2010 1 0 1
8 1-102642676 Liverpool 10 2010 1 0 1
9 1-102642676 Merton 10 2010 1 0 1
10 1-102642676 Cheshire East 10 2010 2 0 2
11 1-102642676 Knowsley 10 2010 1 0 1
12 1-102642676 North Yorkshire 10 2010 1 0 1
13 1-102642676 Kingston upon Thames 10 2010 1 0 1
14 1-102642676 Lewisham 10 2010 1 0 1
15 1-102642676 Wiltshire 10 2010 1 0 1
16 1-102642676 Hampshire 10 2010 1 0 1
17 1-102642676 Wandsworth 10 2010 1 0 1
18 1-102642676 Brent 10 2010 1 0 1
19 1-102642676 West Sussex 10 2010 1 0 1
20 1-102642676 Windsor and Maidenhead 10 2010 1 0 1
21 1-102642676 Luton 10 2010 1 0 1
22 1-102642676 Enfield 10 2010 1 0 1
23 1-102642676 Somerset 10 2010 1 0 1
24 1-102642676 Cambridgeshire 10 2010 1 0 1
25 1-102642676 Hillingdon 10 2010 1 0 1
26 1-102642676 Havering 10 2010 1 0 1
27 1-102642676 Solihull 10 2010 1 0 1
28 1-102642676 Bexley 10 2010 1 0 1
29 1-102642676 Sandwell 10 2010 1 0 1
30 1-102642676 Southampton 10 2010 1 0 1
31 1-102642676 Trafford 10 2010 1 0 1
32 1-102642676 Newham 10 2010 1 0 1
33 1-102642676 West Berkshire 10 2010 1 0 1
34 1-102642676 Reading 10 2010 1 0 1
35 1-102642676 Hartlepool 10 2010 1 0 1
36 1-102642676 Hampshire 3 2011 1 0 1
37 1-102642676 Kent 9 2011 0 1 -1
38 1-102642676 North Yorkshire 12 2011 0 1 -1
39 1-102642676 North Somerset 12 2012 2 0 2
40 1-102642676 Kent 10 2014 1 0 1
41 1-102642676 Somerset 1 2016 0 1 -1
我的目标是创建一个变量来反映每个 Local.Authority
和每个 year
的最后一个变量 (total
) 的累积总和。 total
就是 entry
和 exit
的区别。我试图通过在以下基础上应用 dplyr
来执行此操作:
library(dplyr)
df.1 = df %>% group_by(Local.Authority, year) %>%
mutate(cum.total = cumsum(total)) %>%
arrange(year, month, Local.Authority)
产生以下(错误)结果:
> df.1
Source: local data frame [41 x 8]
Groups: Local.Authority, year [41]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Bexley 10 2010 1 0 1 35
2 1-102642676 Brent 10 2010 1 0 1 25
3 1-102642676 Bristol, City of 10 2010 1 0 1 13
4 1-102642676 Bury 10 2010 1 0 1 3
5 1-102642676 Cambridgeshire 10 2010 1 0 1 31
6 1-102642676 Cheshire East 10 2010 2 0 2 17
7 1-102642676 East Sussex 10 2010 5 0 5 12
8 1-102642676 Enfield 10 2010 1 0 1 29
9 1-102642676 Essex 10 2010 1 0 1 5
10 1-102642676 Hampshire 10 2010 1 0 1 23
.. ... ... ... ... ... ... ... ...
我已经通过检查出现在不同年份(例如肯特)的变量 Local.Authority
中的水平来确认这些结果:
> check = df.1 %>% filter(Local.Authority == "Kent")
> check
Source: local data frame [3 x 8]
Groups: Local.Authority, year [3]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Kent 10 2010 1 0 1 4
2 1-102642676 Kent 9 2011 0 1 -1 42
3 1-102642676 Kent 10 2014 1 0 1 44
它应该在的位置:
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Kent 10 2010 1 0 1 1
2 1-102642676 Kent 9 2011 0 1 -1 0
3 1-102642676 Kent 10 2014 1 0 1 1
有人知道从 cumsum 中得到这些结果会发生什么吗?非常感谢。
当您按 local.Authority 和年份分组时,它采用唯一值并将结果打印为 1,-1,1 因此最好仅按 local.Authority 分组,其中 cumsum 基于总值和结果1,0,1
df <- df %>%
group_by(Local.Authority) %>%
mutate(cum.to = cumsum(total))
> df
Source: local data frame [3 x 8]
Groups: Local.Authority [1]
Provider.ID Local.Authority month year entry exit total cum.to
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1-102642676 Kent 10 2010 1 0 1 1
2 1-102642676 Kent 9 2011 0 1 -1 0
3 1-102642676 Kent 10 2014 1 0 1 1
我找到了问题的解决方案。我重新启动了我的会话,我得到了我的结果,只按地方当局分组,然后安排:
> df.1 = df %>% group_by(Local.Authority) %>%
+ mutate(cum.total = cumsum(total)) %>%
+ arrange(year, month, Local.Authority)
> df.1
Source: local data frame [41 x 8]
Groups: Local.Authority [36]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Bexley 10 2010 1 0 1 1
2 1-102642676 Brent 10 2010 1 0 1 1
3 1-102642676 Bristol, City of 10 2010 1 0 1 1
4 1-102642676 Bury 10 2010 1 0 1 1
5 1-102642676 Cambridgeshire 10 2010 1 0 1 1
6 1-102642676 Cheshire East 10 2010 2 0 2 2
7 1-102642676 East Sussex 10 2010 5 0 5 5
8 1-102642676 Enfield 10 2010 1 0 1 1
9 1-102642676 Essex 10 2010 1 0 1 1
10 1-102642676 Hampshire 10 2010 1 0 1 1
检查 "Kent" 现在它产生了预期的结果:
> check = df.1 %>% filter(Local.Authority == "Kent")
> check
Source: local data frame [3 x 8]
Groups: Local.Authority [1]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Kent 10 2010 1 0 1 1
2 1-102642676 Kent 9 2011 0 1 -1 0
3 1-102642676 Kent 10 2014 1 0 1 1
我有一个数据框 df
(可以下载 here),指的是看起来像这样的公司注册:
Provider.ID Local.Authority month year entry exit total
1 1-102642676 Warwickshire 10 2010 2 0 2
2 1-102642676 Bury 10 2010 1 0 1
3 1-102642676 Kent 10 2010 1 0 1
4 1-102642676 Essex 10 2010 1 0 1
5 1-102642676 Lambeth 10 2010 2 0 2
6 1-102642676 East Sussex 10 2010 5 0 5
7 1-102642676 Bristol, City of 10 2010 1 0 1
8 1-102642676 Liverpool 10 2010 1 0 1
9 1-102642676 Merton 10 2010 1 0 1
10 1-102642676 Cheshire East 10 2010 2 0 2
11 1-102642676 Knowsley 10 2010 1 0 1
12 1-102642676 North Yorkshire 10 2010 1 0 1
13 1-102642676 Kingston upon Thames 10 2010 1 0 1
14 1-102642676 Lewisham 10 2010 1 0 1
15 1-102642676 Wiltshire 10 2010 1 0 1
16 1-102642676 Hampshire 10 2010 1 0 1
17 1-102642676 Wandsworth 10 2010 1 0 1
18 1-102642676 Brent 10 2010 1 0 1
19 1-102642676 West Sussex 10 2010 1 0 1
20 1-102642676 Windsor and Maidenhead 10 2010 1 0 1
21 1-102642676 Luton 10 2010 1 0 1
22 1-102642676 Enfield 10 2010 1 0 1
23 1-102642676 Somerset 10 2010 1 0 1
24 1-102642676 Cambridgeshire 10 2010 1 0 1
25 1-102642676 Hillingdon 10 2010 1 0 1
26 1-102642676 Havering 10 2010 1 0 1
27 1-102642676 Solihull 10 2010 1 0 1
28 1-102642676 Bexley 10 2010 1 0 1
29 1-102642676 Sandwell 10 2010 1 0 1
30 1-102642676 Southampton 10 2010 1 0 1
31 1-102642676 Trafford 10 2010 1 0 1
32 1-102642676 Newham 10 2010 1 0 1
33 1-102642676 West Berkshire 10 2010 1 0 1
34 1-102642676 Reading 10 2010 1 0 1
35 1-102642676 Hartlepool 10 2010 1 0 1
36 1-102642676 Hampshire 3 2011 1 0 1
37 1-102642676 Kent 9 2011 0 1 -1
38 1-102642676 North Yorkshire 12 2011 0 1 -1
39 1-102642676 North Somerset 12 2012 2 0 2
40 1-102642676 Kent 10 2014 1 0 1
41 1-102642676 Somerset 1 2016 0 1 -1
我的目标是创建一个变量来反映每个 Local.Authority
和每个 year
的最后一个变量 (total
) 的累积总和。 total
就是 entry
和 exit
的区别。我试图通过在以下基础上应用 dplyr
来执行此操作:
library(dplyr)
df.1 = df %>% group_by(Local.Authority, year) %>%
mutate(cum.total = cumsum(total)) %>%
arrange(year, month, Local.Authority)
产生以下(错误)结果:
> df.1
Source: local data frame [41 x 8]
Groups: Local.Authority, year [41]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Bexley 10 2010 1 0 1 35
2 1-102642676 Brent 10 2010 1 0 1 25
3 1-102642676 Bristol, City of 10 2010 1 0 1 13
4 1-102642676 Bury 10 2010 1 0 1 3
5 1-102642676 Cambridgeshire 10 2010 1 0 1 31
6 1-102642676 Cheshire East 10 2010 2 0 2 17
7 1-102642676 East Sussex 10 2010 5 0 5 12
8 1-102642676 Enfield 10 2010 1 0 1 29
9 1-102642676 Essex 10 2010 1 0 1 5
10 1-102642676 Hampshire 10 2010 1 0 1 23
.. ... ... ... ... ... ... ... ...
我已经通过检查出现在不同年份(例如肯特)的变量 Local.Authority
中的水平来确认这些结果:
> check = df.1 %>% filter(Local.Authority == "Kent")
> check
Source: local data frame [3 x 8]
Groups: Local.Authority, year [3]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Kent 10 2010 1 0 1 4
2 1-102642676 Kent 9 2011 0 1 -1 42
3 1-102642676 Kent 10 2014 1 0 1 44
它应该在的位置:
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Kent 10 2010 1 0 1 1
2 1-102642676 Kent 9 2011 0 1 -1 0
3 1-102642676 Kent 10 2014 1 0 1 1
有人知道从 cumsum 中得到这些结果会发生什么吗?非常感谢。
当您按 local.Authority 和年份分组时,它采用唯一值并将结果打印为 1,-1,1 因此最好仅按 local.Authority 分组,其中 cumsum 基于总值和结果1,0,1
df <- df %>%
group_by(Local.Authority) %>%
mutate(cum.to = cumsum(total))
> df
Source: local data frame [3 x 8]
Groups: Local.Authority [1]
Provider.ID Local.Authority month year entry exit total cum.to
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1-102642676 Kent 10 2010 1 0 1 1
2 1-102642676 Kent 9 2011 0 1 -1 0
3 1-102642676 Kent 10 2014 1 0 1 1
我找到了问题的解决方案。我重新启动了我的会话,我得到了我的结果,只按地方当局分组,然后安排:
> df.1 = df %>% group_by(Local.Authority) %>%
+ mutate(cum.total = cumsum(total)) %>%
+ arrange(year, month, Local.Authority)
> df.1
Source: local data frame [41 x 8]
Groups: Local.Authority [36]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Bexley 10 2010 1 0 1 1
2 1-102642676 Brent 10 2010 1 0 1 1
3 1-102642676 Bristol, City of 10 2010 1 0 1 1
4 1-102642676 Bury 10 2010 1 0 1 1
5 1-102642676 Cambridgeshire 10 2010 1 0 1 1
6 1-102642676 Cheshire East 10 2010 2 0 2 2
7 1-102642676 East Sussex 10 2010 5 0 5 5
8 1-102642676 Enfield 10 2010 1 0 1 1
9 1-102642676 Essex 10 2010 1 0 1 1
10 1-102642676 Hampshire 10 2010 1 0 1 1
检查 "Kent" 现在它产生了预期的结果:
> check = df.1 %>% filter(Local.Authority == "Kent")
> check
Source: local data frame [3 x 8]
Groups: Local.Authority [1]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Kent 10 2010 1 0 1 1
2 1-102642676 Kent 9 2011 0 1 -1 0
3 1-102642676 Kent 10 2014 1 0 1 1