具有多个 GroupBy 的移动平均线

Moving averages with multiple GroupBy

这是我数据的一小部分:

Team <- rep(c("ind", "sas", "ind", "sas"),c(4,8,2,4))

Player <- c("Paul George", "David West", "Roy Hibbert",
            "Paul George", "Tim Duncan", "Manuel Ginobili",
            "Tony Parker", "Boris Diaw","Danny Green", 
            "Kawhi Leonard", "Matt Bonner", "Patty Mills",
            "George Hill", "C.J.Miles","Tim Duncan",
            "Manuel Ginobili", "Tony Parker", "Boris Diaw")

Team_PTS <- c(101,101,101,98,105,105,105,105,
              105,105,105,105,98,98,89,89,89,128)

Date <- as.Date(c("2015-05-14", "2015-05-14", "2015-05-14",
               "2015-05-16","2015-05-15", "2015-05-15", "2015-05-15",
               "2015-05-15","2015-05-15", "2015-05-15", "2015-05-15",
               "2015-05-15","2015-05-16","2015-05-16","2015-05-29",
               "2015-05-29","2015-05-29","2015-06-03"))

Team_Gamenumber <- rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1))

df <- data.frame(Team,Player,Team_PTS,Date, Team_Gamenumber)

df

   Team          Player Team_PTS       Date Team_Gamenumber Desired_output
1   ind     Paul George      101 2015-05-14               1            101
2   ind      David West      101 2015-05-14               1            101
3   ind     Roy Hibbert      101 2015-05-14               1            101
4   ind     Paul George       98 2015-05-16               2           99.5
5   sas      Tim Duncan      105 2015-05-15               1            105
6   sas Manuel Ginobili      105 2015-05-15               1            105
7   sas     Tony Parker      105 2015-05-15               1            105
8   sas      Boris Diaw      105 2015-05-15               1            105
9   sas     Danny Green      105 2015-05-15               1            105
10  sas   Kawhi Leonard      105 2015-05-15               1            105
11  sas     Matt Bonner      105 2015-05-15               1            105
12  sas     Patty Mills      105 2015-05-15               1            105
13  ind     George Hill       98 2015-05-16               2           99.5
14  ind       C.J.Miles       98 2015-05-16               2           99.5
15  sas      Tim Duncan       89 2015-05-29               2             97
16  sas Manuel Ginobili       89 2015-05-29               2             97
17  sas     Tony Parker       89 2015-05-29               2             97
18  sas      Boris Diaw      128 2015-06-03               3         107.33

所需的输出变量是团队得分的移动或累积平均值(本例中为 sas 和 ind)。

我试过:

library(dplyr)
df %>% group_by(Team) %>%
       mutate(cumavg_PTS = cumsum(Team_PTS) / seq_along(Team_PTS))

然而,由于信息是由玩家组织的,因此会产生错误的输出。请参阅 Boris Diaw 与 sas 一起错过了第 2 场比赛,但参加了第 3 场比赛。

此外,我认为 cumsum 在这种情况下不是正确的方法,因为平均数会受到每场比赛的球员人数的影响。

107.33是sas前3场比赛的平均值(105+89+128)/3

您的 Team_PTS 列似乎是多余的,因为它包含整个 Team 在游戏 Team_Gamenumber 中的得分,但是 data.frame每 名玩家 每场比赛(该玩家参加过)包含一行。因此 TeamTeam_Gamenumber 的每条记录都具有相同的 Team_PTS 值。

因此,您可以 "aggregate" TeamTeam_Gamenumber 上的原始 df,为组取冗余 Team_PTS 向量的第一个元素,因为组中的所有值无论如何都是相同的。作为此 aggregate() 调用的一部分,我还解决了将 Team_PTS 值存储为字符串的问题,这些值由 data.frame() 调用转换为因子。我知道可以做到这一点的最简单方法是将因子强制转换为实际字符串,然后再转换为数字。

聚合后的 table 可以通过按 Team 分组然后使用 cumsum(x)/seq_along(x) 公式补充 Desired_Output 列。然后可以将此结果与 df 合并以产生所需的结果。

另请注意,我手动重新排序 output 以符合您的预期输出,这样我们就可以轻松地通过肉眼验证它是否匹配。

df <- data.frame(Team=rep(c('ind','sas','ind','sas'),c(4,8,2,4)),Player=c('Paul George','David West','Roy Hibbert','Paul George','Tim Duncan','Manuel Ginobili','Tony Parker','Boris Diaw','Danny Green','Kawhi Leonard','Matt Bonner','Patty Mills','George Hill','C.J.Miles','Tim Duncan','Manuel Ginobili','Tony Parker','Boris Diaw'),Team_PTS=c(101,101,101,98,105,105,105,105,105,105,105,105,98,98,89,89,89,128),Date=as.Date(c('2015-05-14','2015-05-14','2015-05-14','2015-05-16','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-16','2015-05-16','2015-05-29','2015-05-29','2015-05-29','2015-06-03')),Team_Gamenumber=rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1)));
output <- merge(df,transform(aggregate(cbind(Team_PTS=as.double(as.character(Team_PTS)))~Team+Team_Gamenumber,df,`[`,1),Desired_Output=ave(Team_PTS,Team,FUN=function(x) cumsum(x)/seq_along(x))))[,c(names(df),'Desired_Output')];
output[c(1:4,9,10,7,8,13,14,11,12,5,6,16:18,15),];
##    Team          Player Team_PTS       Date Team_Gamenumber Desired_Output
## 1   ind     Paul George      101 2015-05-14               1       101.0000
## 2   ind      David West      101 2015-05-14               1       101.0000
## 3   ind     Roy Hibbert      101 2015-05-14               1       101.0000
## 4   ind     Paul George       98 2015-05-16               2        99.5000
## 9   sas      Tim Duncan      105 2015-05-15               1       105.0000
## 10  sas Manuel Ginobili      105 2015-05-15               1       105.0000
## 7   sas     Tony Parker      105 2015-05-15               1       105.0000
## 8   sas      Boris Diaw      105 2015-05-15               1       105.0000
## 13  sas     Danny Green      105 2015-05-15               1       105.0000
## 14  sas   Kawhi Leonard      105 2015-05-15               1       105.0000
## 11  sas     Matt Bonner      105 2015-05-15               1       105.0000
## 12  sas     Patty Mills      105 2015-05-15               1       105.0000
## 5   ind     George Hill       98 2015-05-16               2        99.5000
## 6   ind       C.J.Miles       98 2015-05-16               2        99.5000
## 16  sas      Tim Duncan       89 2015-05-29               2        97.0000
## 17  sas Manuel Ginobili       89 2015-05-29               2        97.0000
## 18  sas     Tony Parker       89 2015-05-29               2        97.0000
## 15  sas      Boris Diaw      128 2015-06-03               3       107.3333

使用 dplyr,一团糟:

df %>% distinct(Team, Team_Gamenumber) %>%
       group_by(Team) %>%
       mutate(cumavg_PTS = cummean(Team_PTS)) %>%
       select(Team, Team_Gamenumber, cumavg_PTS) %>%
       inner_join(df, .)

Joining by: c("Team", "Team_Gamenumber")
   Team          Player Team_PTS       Date Team_Gamenumber cumavg_PTS
1   ind     Paul George      101 2015-05-14               1   101.0000
2   ind      David West      101 2015-05-14               1   101.0000
3   ind     Roy Hibbert      101 2015-05-14               1   101.0000
4   ind     Paul George       98 2015-05-16               2    99.5000
5   sas      Tim Duncan      105 2015-05-15               1   105.0000
6   sas Manuel Ginobili      105 2015-05-15               1   105.0000
7   sas     Tony Parker      105 2015-05-15               1   105.0000
8   sas      Boris Diaw      105 2015-05-15               1   105.0000
9   sas     Danny Green      105 2015-05-15               1   105.0000
10  sas   Kawhi Leonard      105 2015-05-15               1   105.0000
11  sas     Matt Bonner      105 2015-05-15               1   105.0000
12  sas     Patty Mills      105 2015-05-15               1   105.0000
13  ind     George Hill       98 2015-05-16               2    99.5000
14  ind       C.J.Miles       98 2015-05-16               2    99.5000
15  sas      Tim Duncan       89 2015-05-29               2    97.0000
16  sas Manuel Ginobili       89 2015-05-29               2    97.0000
17  sas     Tony Parker       89 2015-05-29               2    97.0000
18  sas      Boris Diaw      128 2015-06-03               3   107.3333

这是另一种方法。我会使用 data.table:

require(data.table)
setDT(df)[, cavg := { dups = !duplicated(Team_Gamenumber)
                      cumsum(Team_PTS * dups) / cumsum(dups)
                    }, by = Team]

或者只写一个函数:

foo <- function(points, game) {
    dups = !duplicated(game)
    cumsum(points * dups) / cumsum(dups)
}
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]

@bgoldst 和@jeremycg 的解决方案还是有区别的。 @bgoldst 计算按 Team, Team_Gamenumber 排序的数据的累积平均值,而 @jeremycg's 通过保留原始顺序进行计算。

例如,根据您的 df,交换 ind = 1 的游戏号码:

setDT(df)[c(1:4,13:14), Team_Gamenumber := c(2,2,2,1,1,1)]
setDF(df)

然后尝试两个版本。


我们可以在保留数据原始顺序的情况下得到两个答案,如下所示:

# @jeremycg's
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
# @bglodst's
setDT(df)[order(Team, Team_Gamenumber), cavg := foo(Team_PTS, Team_Gamenumber), by = Team]