带条件的累积均值
Cumulative mean with conditionals
R 新手。我 df 的小代表:
PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway))
df
TeamHome TeamAway PTS_TeamHome PTS_TeamAway
LAL IND 101 95
HOU LAL 87 89
SAS LAL 94 105
MIA HOU 110 111
LAL NOP 95 121
想象一下,这是一个有 1230 场比赛的赛季的前四场比赛。我想计算主队和客队在任何给定时间的每场比赛的累积积分(平均值)。
输出将如下所示:
TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
1 LAL IND 101 95 101 95
2 HOU LAL 87 89 87 95
3 SAS LAL 94 105 94 98.33
4 MIA HOU 110 111 110 99
5 LAL NOP 95 121 97.5 121
请注意公式对主队第五场比赛的作用。由于 LAL 是主队,因此它会查找 LAL 在主场或客场比赛时得分多少。在这种情况下 (101 + 89 + 105 + 95) / 4 = 97.5
这是我尝试过但没有成功的方法:
lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- ( cumsum(df[which(df$TEAM1[1:i]==df$TEAM1[i]),df$PTS_TeamAway,0])
+ cumsum(df[which(df$TEAM2[1:i]==df$TEAM1[i]),df$PTS_TeamHome,0]) )
/ #divided by number of games
df$HOMETEAM_AVGCUMPTS <- unlist(lst)
我想计算累积 PTS,然后将其除以游戏数,但 none 成功了。
lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- mean(c(df$PTS_TeamHome[1:i][df$TeamHome[1:i] == df$TeamHome[i]],
df$PTS_TeamAway[1:i][df$TeamAway[1:i] == df$TeamHome[i]]))
df$HOMETEAM_AVGCUMPTS <- unlist(lst)
lst2 <- list()
for(i in 1:nrow(df)) lst2[[i]] <- mean(c(df$PTS_TeamAway[1:i][df$TeamAway[1:i] == df$TeamAway[i]],
df$PTS_TeamHome[1:i][df$TeamHome[1:i] == df$TeamAway[i]]))
df$ROADTEAM_AVGCUMPTS <- unlist(lst2)
df
# TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
# 1 LAL IND 101 95 101 95
# 2 HOU LAL 87 89 87 95
# 3 SAS LAL 94 105 94 98.33333
# 4 MIA HOU 110 111 110 99
# 5 LAL NOP 95 121 97.5 121
该方法分为两个循环。我们取两个向量的平均值。它们与 mean(c(vec1,vec2))
格式组合。
第一个向量是主队在主场时的得分集(team in col1, pts in col3),第二个向量是主队在客场时的得分集(team在 col2 中,pts 在 col4 中)。我们使用 for 循环,因为它允许我们轻松控制子集中考虑的行数。使用df$PTS_TeamHome[1:i]
,集合仅限于过去玩过的游戏和当前游戏。我们用 [df$TeamHome[1:i] == df$TeamHome[i]]
对该向量进行子集化。在简单的语言中,表达式是 "Teams in the "TeamHome 类别直到当前游戏,等于当前正在玩的主队。使用这些参数,我们不会允许 "future" 游戏破坏分析。
对于数据,我将 stringsAsFactors
参数设置为 FALSE
。并将点数列转换为 class numeric
。见下文。
数据
PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway), stringsAsFactors=F)
df[3:4] <- lapply(df[3:4], function(x) as.numeric(x))
我认为您应该在 tidier format 中重组您的数据,每场比赛两行:客队一行,主队一行。使用 tidy/long 格式的数据要容易得多。
library(dplyr)
library(tidyr)
df %>%
mutate(game = row_number()) %>%
gather(location, team, TeamHome, TeamAway) %>%
gather(location2, points, PTS_TeamHome, PTS_TeamAway) %>%
filter(
(location == "TeamHome" & location2 == "PTS_TeamHome") |
(location == "TeamAway" & location2 == "PTS_TeamAway")
) %>%
select(-location2) %>%
arrange(game) %>%
group_by(team) %>%
mutate(run_mean_points = cummean(points))
数据
# note that cbind() is removed.
df <- data.frame(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway, stringsAsFactors = FALSE)
Source: local data frame [10 x 5]
Groups: team
game location team points run_mean_points
1 1 TeamHome LAL 101 101.00000
2 1 TeamAway IND 95 95.00000
3 2 TeamHome HOU 87 87.00000
4 2 TeamAway LAL 89 95.00000
5 3 TeamHome SAS 94 94.00000
6 3 TeamAway LAL 105 98.33333
7 4 TeamHome MIA 110 110.00000
8 4 TeamAway HOU 111 99.00000
9 5 TeamHome LAL 95 97.50000
10 5 TeamAway NOP 121 121.00000
这是一个简短的循环版本,它只会对每个唯一的团队名称进行一次(而不是每行两次)。这里的想法是预先分配一个具有所需大小的矩阵,然后 运行 一个简短的 for
循环遍历唯一的团队名称,同时在矩阵中填充正确的条目。我们正在以转置形式创建矩阵和临时数据集,因此值将按行而不是按列填充(默认为 Rs),因为游戏序列是按行
## Transpose the data once
tempdf <- t(df)
## Create transposed matrix with future column names
mat <- matrix(NA, 2, nrow(df))
rownames(mat) <- c("HOMETEAM_AVGCUMPTS", "ROADTEAM_AVGCUMPTS")
## Create a vector of unique team names
indx <- as.character(unique(unlist(df[1:2])))
## Run the loop only over the unique team names
for (i in indx) {
indx2 <- tempdf[1:2, ] == i
temp <- tempdf[3:4, ][indx2]
mat[indx2] <- cumsum(temp)/seq_along(temp)
}
## Combine result with the original data
cbind(df, t(mat))
# TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
# 1 LAL IND 101 95 101.0 95.00000
# 2 HOU LAL 87 89 87.0 95.00000
# 3 SAS LAL 94 105 94.0 98.33333
# 4 MIA HOU 110 111 110.0 99.00000
# 5 LAL NOP 95 121 97.5 121.00000
Transpose。 这是一种方法,在@DavidArenburg 的回答中重复循环:
sv <- t(df[3:4])
tv <- t(df[1:2])
df[c("homeavg","awayavg")] <- t(ave(sv,tv,FUN=cummean))
cummean
来自library(dplyr)
;如果需要,您可以将其切换为基本 R 模拟;列名也是如此。
或交错。 上面的所有换位都很难理解。相反,您可以使用 Arun's approach:
交错向量
interleave <- function(a,b) c(a,b)[order(c(seq_along(a), seq_along(b)))]
unleave <- function(x) split(x,1:2)
sv2 <- interleave(df$PTS_TeamHome,df$PTS_TeamAway)
tv2 <- interleave(df$TeamHome,df$TeamAway)
df[c("homeavg","awayavg")] <- unleave(ave(sv2,tv2,FUN=cummean))
R 新手。我 df 的小代表:
PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway))
df
TeamHome TeamAway PTS_TeamHome PTS_TeamAway
LAL IND 101 95
HOU LAL 87 89
SAS LAL 94 105
MIA HOU 110 111
LAL NOP 95 121
想象一下,这是一个有 1230 场比赛的赛季的前四场比赛。我想计算主队和客队在任何给定时间的每场比赛的累积积分(平均值)。
输出将如下所示:
TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
1 LAL IND 101 95 101 95
2 HOU LAL 87 89 87 95
3 SAS LAL 94 105 94 98.33
4 MIA HOU 110 111 110 99
5 LAL NOP 95 121 97.5 121
请注意公式对主队第五场比赛的作用。由于 LAL 是主队,因此它会查找 LAL 在主场或客场比赛时得分多少。在这种情况下 (101 + 89 + 105 + 95) / 4 = 97.5
这是我尝试过但没有成功的方法:
lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- ( cumsum(df[which(df$TEAM1[1:i]==df$TEAM1[i]),df$PTS_TeamAway,0])
+ cumsum(df[which(df$TEAM2[1:i]==df$TEAM1[i]),df$PTS_TeamHome,0]) )
/ #divided by number of games
df$HOMETEAM_AVGCUMPTS <- unlist(lst)
我想计算累积 PTS,然后将其除以游戏数,但 none 成功了。
lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- mean(c(df$PTS_TeamHome[1:i][df$TeamHome[1:i] == df$TeamHome[i]],
df$PTS_TeamAway[1:i][df$TeamAway[1:i] == df$TeamHome[i]]))
df$HOMETEAM_AVGCUMPTS <- unlist(lst)
lst2 <- list()
for(i in 1:nrow(df)) lst2[[i]] <- mean(c(df$PTS_TeamAway[1:i][df$TeamAway[1:i] == df$TeamAway[i]],
df$PTS_TeamHome[1:i][df$TeamHome[1:i] == df$TeamAway[i]]))
df$ROADTEAM_AVGCUMPTS <- unlist(lst2)
df
# TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
# 1 LAL IND 101 95 101 95
# 2 HOU LAL 87 89 87 95
# 3 SAS LAL 94 105 94 98.33333
# 4 MIA HOU 110 111 110 99
# 5 LAL NOP 95 121 97.5 121
该方法分为两个循环。我们取两个向量的平均值。它们与 mean(c(vec1,vec2))
格式组合。
第一个向量是主队在主场时的得分集(team in col1, pts in col3),第二个向量是主队在客场时的得分集(team在 col2 中,pts 在 col4 中)。我们使用 for 循环,因为它允许我们轻松控制子集中考虑的行数。使用df$PTS_TeamHome[1:i]
,集合仅限于过去玩过的游戏和当前游戏。我们用 [df$TeamHome[1:i] == df$TeamHome[i]]
对该向量进行子集化。在简单的语言中,表达式是 "Teams in the "TeamHome 类别直到当前游戏,等于当前正在玩的主队。使用这些参数,我们不会允许 "future" 游戏破坏分析。
对于数据,我将 stringsAsFactors
参数设置为 FALSE
。并将点数列转换为 class numeric
。见下文。
数据
PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway), stringsAsFactors=F)
df[3:4] <- lapply(df[3:4], function(x) as.numeric(x))
我认为您应该在 tidier format 中重组您的数据,每场比赛两行:客队一行,主队一行。使用 tidy/long 格式的数据要容易得多。
library(dplyr)
library(tidyr)
df %>%
mutate(game = row_number()) %>%
gather(location, team, TeamHome, TeamAway) %>%
gather(location2, points, PTS_TeamHome, PTS_TeamAway) %>%
filter(
(location == "TeamHome" & location2 == "PTS_TeamHome") |
(location == "TeamAway" & location2 == "PTS_TeamAway")
) %>%
select(-location2) %>%
arrange(game) %>%
group_by(team) %>%
mutate(run_mean_points = cummean(points))
数据
# note that cbind() is removed.
df <- data.frame(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway, stringsAsFactors = FALSE)
Source: local data frame [10 x 5]
Groups: team
game location team points run_mean_points
1 1 TeamHome LAL 101 101.00000
2 1 TeamAway IND 95 95.00000
3 2 TeamHome HOU 87 87.00000
4 2 TeamAway LAL 89 95.00000
5 3 TeamHome SAS 94 94.00000
6 3 TeamAway LAL 105 98.33333
7 4 TeamHome MIA 110 110.00000
8 4 TeamAway HOU 111 99.00000
9 5 TeamHome LAL 95 97.50000
10 5 TeamAway NOP 121 121.00000
这是一个简短的循环版本,它只会对每个唯一的团队名称进行一次(而不是每行两次)。这里的想法是预先分配一个具有所需大小的矩阵,然后 运行 一个简短的 for
循环遍历唯一的团队名称,同时在矩阵中填充正确的条目。我们正在以转置形式创建矩阵和临时数据集,因此值将按行而不是按列填充(默认为 Rs),因为游戏序列是按行
## Transpose the data once
tempdf <- t(df)
## Create transposed matrix with future column names
mat <- matrix(NA, 2, nrow(df))
rownames(mat) <- c("HOMETEAM_AVGCUMPTS", "ROADTEAM_AVGCUMPTS")
## Create a vector of unique team names
indx <- as.character(unique(unlist(df[1:2])))
## Run the loop only over the unique team names
for (i in indx) {
indx2 <- tempdf[1:2, ] == i
temp <- tempdf[3:4, ][indx2]
mat[indx2] <- cumsum(temp)/seq_along(temp)
}
## Combine result with the original data
cbind(df, t(mat))
# TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
# 1 LAL IND 101 95 101.0 95.00000
# 2 HOU LAL 87 89 87.0 95.00000
# 3 SAS LAL 94 105 94.0 98.33333
# 4 MIA HOU 110 111 110.0 99.00000
# 5 LAL NOP 95 121 97.5 121.00000
Transpose。 这是一种方法,在@DavidArenburg 的回答中重复循环:
sv <- t(df[3:4])
tv <- t(df[1:2])
df[c("homeavg","awayavg")] <- t(ave(sv,tv,FUN=cummean))
cummean
来自library(dplyr)
;如果需要,您可以将其切换为基本 R 模拟;列名也是如此。
或交错。 上面的所有换位都很难理解。相反,您可以使用 Arun's approach:
交错向量interleave <- function(a,b) c(a,b)[order(c(seq_along(a), seq_along(b)))]
unleave <- function(x) split(x,1:2)
sv2 <- interleave(df$PTS_TeamHome,df$PTS_TeamAway)
tv2 <- interleave(df$TeamHome,df$TeamAway)
df[c("homeavg","awayavg")] <- unleave(ave(sv2,tv2,FUN=cummean))