具有多个 GroupBy 的移动平均线
Moving averages with multiple GroupBy
这是我数据的一小部分:
Team <- rep(c("ind", "sas", "ind", "sas"),c(4,8,2,4))
Player <- c("Paul George", "David West", "Roy Hibbert",
"Paul George", "Tim Duncan", "Manuel Ginobili",
"Tony Parker", "Boris Diaw","Danny Green",
"Kawhi Leonard", "Matt Bonner", "Patty Mills",
"George Hill", "C.J.Miles","Tim Duncan",
"Manuel Ginobili", "Tony Parker", "Boris Diaw")
Team_PTS <- c(101,101,101,98,105,105,105,105,
105,105,105,105,98,98,89,89,89,128)
Date <- as.Date(c("2015-05-14", "2015-05-14", "2015-05-14",
"2015-05-16","2015-05-15", "2015-05-15", "2015-05-15",
"2015-05-15","2015-05-15", "2015-05-15", "2015-05-15",
"2015-05-15","2015-05-16","2015-05-16","2015-05-29",
"2015-05-29","2015-05-29","2015-06-03"))
Team_Gamenumber <- rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1))
df <- data.frame(Team,Player,Team_PTS,Date, Team_Gamenumber)
df
Team Player Team_PTS Date Team_Gamenumber Desired_output
1 ind Paul George 101 2015-05-14 1 101
2 ind David West 101 2015-05-14 1 101
3 ind Roy Hibbert 101 2015-05-14 1 101
4 ind Paul George 98 2015-05-16 2 99.5
5 sas Tim Duncan 105 2015-05-15 1 105
6 sas Manuel Ginobili 105 2015-05-15 1 105
7 sas Tony Parker 105 2015-05-15 1 105
8 sas Boris Diaw 105 2015-05-15 1 105
9 sas Danny Green 105 2015-05-15 1 105
10 sas Kawhi Leonard 105 2015-05-15 1 105
11 sas Matt Bonner 105 2015-05-15 1 105
12 sas Patty Mills 105 2015-05-15 1 105
13 ind George Hill 98 2015-05-16 2 99.5
14 ind C.J.Miles 98 2015-05-16 2 99.5
15 sas Tim Duncan 89 2015-05-29 2 97
16 sas Manuel Ginobili 89 2015-05-29 2 97
17 sas Tony Parker 89 2015-05-29 2 97
18 sas Boris Diaw 128 2015-06-03 3 107.33
所需的输出变量是团队得分的移动或累积平均值(本例中为 sas 和 ind)。
我试过:
library(dplyr)
df %>% group_by(Team) %>%
mutate(cumavg_PTS = cumsum(Team_PTS) / seq_along(Team_PTS))
然而,由于信息是由玩家组织的,因此会产生错误的输出。请参阅 Boris Diaw 与 sas 一起错过了第 2 场比赛,但参加了第 3 场比赛。
此外,我认为 cumsum
在这种情况下不是正确的方法,因为平均数会受到每场比赛的球员人数的影响。
107.33是sas前3场比赛的平均值(105+89+128)/3
您的 Team_PTS
列似乎是多余的,因为它包含整个 Team
在游戏 Team_Gamenumber
中的得分,但是 data.frame每 名玩家 每场比赛(该玩家参加过)包含一行。因此 Team
和 Team_Gamenumber
的每条记录都具有相同的 Team_PTS
值。
因此,您可以 "aggregate" Team
和 Team_Gamenumber
上的原始 df
,为组取冗余 Team_PTS
向量的第一个元素,因为组中的所有值无论如何都是相同的。作为此 aggregate()
调用的一部分,我还解决了将 Team_PTS
值存储为字符串的问题,这些值由 data.frame()
调用转换为因子。我知道可以做到这一点的最简单方法是将因子强制转换为实际字符串,然后再转换为数字。
聚合后的 table 可以通过按 Team
分组然后使用 cumsum(x)/seq_along(x)
公式补充 Desired_Output
列。然后可以将此结果与 df
合并以产生所需的结果。
另请注意,我手动重新排序 output
以符合您的预期输出,这样我们就可以轻松地通过肉眼验证它是否匹配。
df <- data.frame(Team=rep(c('ind','sas','ind','sas'),c(4,8,2,4)),Player=c('Paul George','David West','Roy Hibbert','Paul George','Tim Duncan','Manuel Ginobili','Tony Parker','Boris Diaw','Danny Green','Kawhi Leonard','Matt Bonner','Patty Mills','George Hill','C.J.Miles','Tim Duncan','Manuel Ginobili','Tony Parker','Boris Diaw'),Team_PTS=c(101,101,101,98,105,105,105,105,105,105,105,105,98,98,89,89,89,128),Date=as.Date(c('2015-05-14','2015-05-14','2015-05-14','2015-05-16','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-16','2015-05-16','2015-05-29','2015-05-29','2015-05-29','2015-06-03')),Team_Gamenumber=rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1)));
output <- merge(df,transform(aggregate(cbind(Team_PTS=as.double(as.character(Team_PTS)))~Team+Team_Gamenumber,df,`[`,1),Desired_Output=ave(Team_PTS,Team,FUN=function(x) cumsum(x)/seq_along(x))))[,c(names(df),'Desired_Output')];
output[c(1:4,9,10,7,8,13,14,11,12,5,6,16:18,15),];
## Team Player Team_PTS Date Team_Gamenumber Desired_Output
## 1 ind Paul George 101 2015-05-14 1 101.0000
## 2 ind David West 101 2015-05-14 1 101.0000
## 3 ind Roy Hibbert 101 2015-05-14 1 101.0000
## 4 ind Paul George 98 2015-05-16 2 99.5000
## 9 sas Tim Duncan 105 2015-05-15 1 105.0000
## 10 sas Manuel Ginobili 105 2015-05-15 1 105.0000
## 7 sas Tony Parker 105 2015-05-15 1 105.0000
## 8 sas Boris Diaw 105 2015-05-15 1 105.0000
## 13 sas Danny Green 105 2015-05-15 1 105.0000
## 14 sas Kawhi Leonard 105 2015-05-15 1 105.0000
## 11 sas Matt Bonner 105 2015-05-15 1 105.0000
## 12 sas Patty Mills 105 2015-05-15 1 105.0000
## 5 ind George Hill 98 2015-05-16 2 99.5000
## 6 ind C.J.Miles 98 2015-05-16 2 99.5000
## 16 sas Tim Duncan 89 2015-05-29 2 97.0000
## 17 sas Manuel Ginobili 89 2015-05-29 2 97.0000
## 18 sas Tony Parker 89 2015-05-29 2 97.0000
## 15 sas Boris Diaw 128 2015-06-03 3 107.3333
使用 dplyr
,一团糟:
df %>% distinct(Team, Team_Gamenumber) %>%
group_by(Team) %>%
mutate(cumavg_PTS = cummean(Team_PTS)) %>%
select(Team, Team_Gamenumber, cumavg_PTS) %>%
inner_join(df, .)
Joining by: c("Team", "Team_Gamenumber")
Team Player Team_PTS Date Team_Gamenumber cumavg_PTS
1 ind Paul George 101 2015-05-14 1 101.0000
2 ind David West 101 2015-05-14 1 101.0000
3 ind Roy Hibbert 101 2015-05-14 1 101.0000
4 ind Paul George 98 2015-05-16 2 99.5000
5 sas Tim Duncan 105 2015-05-15 1 105.0000
6 sas Manuel Ginobili 105 2015-05-15 1 105.0000
7 sas Tony Parker 105 2015-05-15 1 105.0000
8 sas Boris Diaw 105 2015-05-15 1 105.0000
9 sas Danny Green 105 2015-05-15 1 105.0000
10 sas Kawhi Leonard 105 2015-05-15 1 105.0000
11 sas Matt Bonner 105 2015-05-15 1 105.0000
12 sas Patty Mills 105 2015-05-15 1 105.0000
13 ind George Hill 98 2015-05-16 2 99.5000
14 ind C.J.Miles 98 2015-05-16 2 99.5000
15 sas Tim Duncan 89 2015-05-29 2 97.0000
16 sas Manuel Ginobili 89 2015-05-29 2 97.0000
17 sas Tony Parker 89 2015-05-29 2 97.0000
18 sas Boris Diaw 128 2015-06-03 3 107.3333
这是另一种方法。我会使用 data.table
:
require(data.table)
setDT(df)[, cavg := { dups = !duplicated(Team_Gamenumber)
cumsum(Team_PTS * dups) / cumsum(dups)
}, by = Team]
或者只写一个函数:
foo <- function(points, game) {
dups = !duplicated(game)
cumsum(points * dups) / cumsum(dups)
}
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
@bgoldst 和@jeremycg 的解决方案还是有区别的。 @bgoldst 计算按 Team, Team_Gamenumber
排序的数据的累积平均值,而 @jeremycg's 通过保留原始顺序进行计算。
例如,根据您的 df
,交换 ind = 1
的游戏号码:
setDT(df)[c(1:4,13:14), Team_Gamenumber := c(2,2,2,1,1,1)]
setDF(df)
然后尝试两个版本。
我们可以在保留数据原始顺序的情况下得到两个答案,如下所示:
# @jeremycg's
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
# @bglodst's
setDT(df)[order(Team, Team_Gamenumber), cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
这是我数据的一小部分:
Team <- rep(c("ind", "sas", "ind", "sas"),c(4,8,2,4))
Player <- c("Paul George", "David West", "Roy Hibbert",
"Paul George", "Tim Duncan", "Manuel Ginobili",
"Tony Parker", "Boris Diaw","Danny Green",
"Kawhi Leonard", "Matt Bonner", "Patty Mills",
"George Hill", "C.J.Miles","Tim Duncan",
"Manuel Ginobili", "Tony Parker", "Boris Diaw")
Team_PTS <- c(101,101,101,98,105,105,105,105,
105,105,105,105,98,98,89,89,89,128)
Date <- as.Date(c("2015-05-14", "2015-05-14", "2015-05-14",
"2015-05-16","2015-05-15", "2015-05-15", "2015-05-15",
"2015-05-15","2015-05-15", "2015-05-15", "2015-05-15",
"2015-05-15","2015-05-16","2015-05-16","2015-05-29",
"2015-05-29","2015-05-29","2015-06-03"))
Team_Gamenumber <- rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1))
df <- data.frame(Team,Player,Team_PTS,Date, Team_Gamenumber)
df
Team Player Team_PTS Date Team_Gamenumber Desired_output
1 ind Paul George 101 2015-05-14 1 101
2 ind David West 101 2015-05-14 1 101
3 ind Roy Hibbert 101 2015-05-14 1 101
4 ind Paul George 98 2015-05-16 2 99.5
5 sas Tim Duncan 105 2015-05-15 1 105
6 sas Manuel Ginobili 105 2015-05-15 1 105
7 sas Tony Parker 105 2015-05-15 1 105
8 sas Boris Diaw 105 2015-05-15 1 105
9 sas Danny Green 105 2015-05-15 1 105
10 sas Kawhi Leonard 105 2015-05-15 1 105
11 sas Matt Bonner 105 2015-05-15 1 105
12 sas Patty Mills 105 2015-05-15 1 105
13 ind George Hill 98 2015-05-16 2 99.5
14 ind C.J.Miles 98 2015-05-16 2 99.5
15 sas Tim Duncan 89 2015-05-29 2 97
16 sas Manuel Ginobili 89 2015-05-29 2 97
17 sas Tony Parker 89 2015-05-29 2 97
18 sas Boris Diaw 128 2015-06-03 3 107.33
所需的输出变量是团队得分的移动或累积平均值(本例中为 sas 和 ind)。
我试过:
library(dplyr)
df %>% group_by(Team) %>%
mutate(cumavg_PTS = cumsum(Team_PTS) / seq_along(Team_PTS))
然而,由于信息是由玩家组织的,因此会产生错误的输出。请参阅 Boris Diaw 与 sas 一起错过了第 2 场比赛,但参加了第 3 场比赛。
此外,我认为 cumsum
在这种情况下不是正确的方法,因为平均数会受到每场比赛的球员人数的影响。
107.33是sas前3场比赛的平均值(105+89+128)/3
您的 Team_PTS
列似乎是多余的,因为它包含整个 Team
在游戏 Team_Gamenumber
中的得分,但是 data.frame每 名玩家 每场比赛(该玩家参加过)包含一行。因此 Team
和 Team_Gamenumber
的每条记录都具有相同的 Team_PTS
值。
因此,您可以 "aggregate" Team
和 Team_Gamenumber
上的原始 df
,为组取冗余 Team_PTS
向量的第一个元素,因为组中的所有值无论如何都是相同的。作为此 aggregate()
调用的一部分,我还解决了将 Team_PTS
值存储为字符串的问题,这些值由 data.frame()
调用转换为因子。我知道可以做到这一点的最简单方法是将因子强制转换为实际字符串,然后再转换为数字。
聚合后的 table 可以通过按 Team
分组然后使用 cumsum(x)/seq_along(x)
公式补充 Desired_Output
列。然后可以将此结果与 df
合并以产生所需的结果。
另请注意,我手动重新排序 output
以符合您的预期输出,这样我们就可以轻松地通过肉眼验证它是否匹配。
df <- data.frame(Team=rep(c('ind','sas','ind','sas'),c(4,8,2,4)),Player=c('Paul George','David West','Roy Hibbert','Paul George','Tim Duncan','Manuel Ginobili','Tony Parker','Boris Diaw','Danny Green','Kawhi Leonard','Matt Bonner','Patty Mills','George Hill','C.J.Miles','Tim Duncan','Manuel Ginobili','Tony Parker','Boris Diaw'),Team_PTS=c(101,101,101,98,105,105,105,105,105,105,105,105,98,98,89,89,89,128),Date=as.Date(c('2015-05-14','2015-05-14','2015-05-14','2015-05-16','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-16','2015-05-16','2015-05-29','2015-05-29','2015-05-29','2015-06-03')),Team_Gamenumber=rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1)));
output <- merge(df,transform(aggregate(cbind(Team_PTS=as.double(as.character(Team_PTS)))~Team+Team_Gamenumber,df,`[`,1),Desired_Output=ave(Team_PTS,Team,FUN=function(x) cumsum(x)/seq_along(x))))[,c(names(df),'Desired_Output')];
output[c(1:4,9,10,7,8,13,14,11,12,5,6,16:18,15),];
## Team Player Team_PTS Date Team_Gamenumber Desired_Output
## 1 ind Paul George 101 2015-05-14 1 101.0000
## 2 ind David West 101 2015-05-14 1 101.0000
## 3 ind Roy Hibbert 101 2015-05-14 1 101.0000
## 4 ind Paul George 98 2015-05-16 2 99.5000
## 9 sas Tim Duncan 105 2015-05-15 1 105.0000
## 10 sas Manuel Ginobili 105 2015-05-15 1 105.0000
## 7 sas Tony Parker 105 2015-05-15 1 105.0000
## 8 sas Boris Diaw 105 2015-05-15 1 105.0000
## 13 sas Danny Green 105 2015-05-15 1 105.0000
## 14 sas Kawhi Leonard 105 2015-05-15 1 105.0000
## 11 sas Matt Bonner 105 2015-05-15 1 105.0000
## 12 sas Patty Mills 105 2015-05-15 1 105.0000
## 5 ind George Hill 98 2015-05-16 2 99.5000
## 6 ind C.J.Miles 98 2015-05-16 2 99.5000
## 16 sas Tim Duncan 89 2015-05-29 2 97.0000
## 17 sas Manuel Ginobili 89 2015-05-29 2 97.0000
## 18 sas Tony Parker 89 2015-05-29 2 97.0000
## 15 sas Boris Diaw 128 2015-06-03 3 107.3333
使用 dplyr
,一团糟:
df %>% distinct(Team, Team_Gamenumber) %>%
group_by(Team) %>%
mutate(cumavg_PTS = cummean(Team_PTS)) %>%
select(Team, Team_Gamenumber, cumavg_PTS) %>%
inner_join(df, .)
Joining by: c("Team", "Team_Gamenumber")
Team Player Team_PTS Date Team_Gamenumber cumavg_PTS
1 ind Paul George 101 2015-05-14 1 101.0000
2 ind David West 101 2015-05-14 1 101.0000
3 ind Roy Hibbert 101 2015-05-14 1 101.0000
4 ind Paul George 98 2015-05-16 2 99.5000
5 sas Tim Duncan 105 2015-05-15 1 105.0000
6 sas Manuel Ginobili 105 2015-05-15 1 105.0000
7 sas Tony Parker 105 2015-05-15 1 105.0000
8 sas Boris Diaw 105 2015-05-15 1 105.0000
9 sas Danny Green 105 2015-05-15 1 105.0000
10 sas Kawhi Leonard 105 2015-05-15 1 105.0000
11 sas Matt Bonner 105 2015-05-15 1 105.0000
12 sas Patty Mills 105 2015-05-15 1 105.0000
13 ind George Hill 98 2015-05-16 2 99.5000
14 ind C.J.Miles 98 2015-05-16 2 99.5000
15 sas Tim Duncan 89 2015-05-29 2 97.0000
16 sas Manuel Ginobili 89 2015-05-29 2 97.0000
17 sas Tony Parker 89 2015-05-29 2 97.0000
18 sas Boris Diaw 128 2015-06-03 3 107.3333
这是另一种方法。我会使用 data.table
:
require(data.table)
setDT(df)[, cavg := { dups = !duplicated(Team_Gamenumber)
cumsum(Team_PTS * dups) / cumsum(dups)
}, by = Team]
或者只写一个函数:
foo <- function(points, game) {
dups = !duplicated(game)
cumsum(points * dups) / cumsum(dups)
}
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
@bgoldst 和@jeremycg 的解决方案还是有区别的。 @bgoldst 计算按 Team, Team_Gamenumber
排序的数据的累积平均值,而 @jeremycg's 通过保留原始顺序进行计算。
例如,根据您的 df
,交换 ind = 1
的游戏号码:
setDT(df)[c(1:4,13:14), Team_Gamenumber := c(2,2,2,1,1,1)]
setDF(df)
然后尝试两个版本。
我们可以在保留数据原始顺序的情况下得到两个答案,如下所示:
# @jeremycg's
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
# @bglodst's
setDT(df)[order(Team, Team_Gamenumber), cavg := foo(Team_PTS, Team_Gamenumber), by = Team]