R 中具有特定 ID 的条目数的平均值?
Mean for number of entries with certain ID in R?
所以假设我在 R 中使用的数据集看起来像这样:
player at_bat opponent_name game result
Torri_Hunter 1 Pittsburgh Pirates 1 home run
Torri_Hunter 2 Pittsburgh Pirates 1 triple
Torri_Hunter 3 Pittsburgh Pirates 1 strikeout
Torri_Hunter 4 Pittsburgh Pirates 1 strikeout
Torri_Hunter 1 Pittsburgh Pirates 2 groundout
Torri_Hunter 2 Pittsburgh Pirates 2 home run
Torri_Hunter 3 Pittsburgh Pirates 2 flyout
Torri_Hunter 1 Pittsburgh Pirates 2 home run
Torri_Hunter 2 Pittsburgh Pirates 3 triple
Torri_Hunter 3 Pittsburgh Pirates 3 strikeout
Torri_Hunter 4 Pittsburgh Pirates 3 strikeout
Torri_Hunter 1 Detroit Tigers 1 home run
Torri_Hunter 2 Detroit Tigers 1 home run
Torri_Hunter 3 Detroit Tigers 1 home run
Torri_Hunter 4 Detroit Tigers 1 strikeout
(我发现鸟居的名字拼错了,请耐心等待)。
我最终想计算系列赛中本垒打的百分比,结果如下所示:
opponent_name game_1s game_2s game_3s
Torri Hunter Pittsburgh Pirates 25% 50% 0%
Torri Hunter Detroit Tigers 75% -- --
我可以 dplyr::filter 统计结果,通过 ID 计算每场比赛的统计数据,然后导出到 .csv,在那里我可以获得 excel 中的平均值(这就是我一直以来所做的)做),但必须有一种更快的方法来完全在 R 中完成此操作。有什么想法吗?
你可以这样做:
library(dplyr)
df %>%
group_by(player, opponent_name, game) %>%
summarise(p = sum(result == "home run") / n())
给出:
#Source: local data frame [4 x 4]
#Groups: player, opponent_name
#
# player opponent_name game p
#1 Torri_Hunter Detroit Tigers 1 0.75
#2 Torri_Hunter Pittsburgh Pirates 1 0.25
#3 Torri_Hunter Pittsburgh Pirates 2 0.50
#4 Torri_Hunter Pittsburgh Pirates 3 0.00
要匹配您想要的输出,您还可以这样做:
df %>%
group_by(player, opponent_name, game) %>%
summarise(p = mean(result == "home run")) %>%
tidyr::spread(game, p) %>%
arrange(desc(opponent_name)) %>%
setNames(c(names(.)[1:2], paste0("game_", names(.)[3:5], "s"))) %>%
mutate_each(funs(ifelse(is.na(.), "--", paste0(. * 100, "%"))), -(player:opponent_name))
给出:
#Source: local data frame [2 x 5]
#
# player opponent_name game_1s game_2s game_3s
#1 Torri_Hunter Pittsburgh Pirates 25% 50% 0%
#2 Torri_Hunter Detroit Tigers 75% -- --
要不写两个函数来帮你?假设您的数据框是 call df.
perc_res <- function(opponent, game="1" player="Torri_Hunter", result="home run"){
return(
dim(df[df$player==player & df$opponent==opponent & df$result==result & df$game==game,])[1]/
dim(df[df$player==player & df$opponent==opponent & df$game==game,])[1]
)
}
然后您可以制作一个看起来像
的输出数据框
out.df <- data.frame(Opponent=levels(factor(df$opponent)), Player="Torri_Hunter")
out.df$game1s <- lapply(out.df$Opponent, perc_res, game=1)
等
如果以后想有更多的玩家,可以用mapply
.
ps:实际上 运行 还没有代码,所以可能仍然存在一些一般性错误。但我认为这至少应该让你入门!
一个 data.table
的铸造解决方案是
require(data.table)
setDT(dat)
percentage <- dat[,mean(result == "home run"), by = c("player", "opponent_name", "game")]
结果:
> percentage
player opponent_name game V1
1: Torri_Hunter Pittsburgh Pirates 1 0.25
2: Torri_Hunter Pittsburgh Pirates 2 0.50
3: Torri_Hunter Pittsburgh Pirates 3 0.00
4: Torri_Hunter Detroit Tigers 1 0.75
根据问题的要求将其转换为输出
require(reshape2)
dcast(percentage, player + opponent_name ~ game , value.var = "V1")
结果:
player opponent_name 1 2 3
1 Torri_Hunter Detroit Tigers 0.75 NA NA
2 Torri_Hunter Pittsburgh Pirates 0.25 0.5 0
所以假设我在 R 中使用的数据集看起来像这样:
player at_bat opponent_name game result
Torri_Hunter 1 Pittsburgh Pirates 1 home run
Torri_Hunter 2 Pittsburgh Pirates 1 triple
Torri_Hunter 3 Pittsburgh Pirates 1 strikeout
Torri_Hunter 4 Pittsburgh Pirates 1 strikeout
Torri_Hunter 1 Pittsburgh Pirates 2 groundout
Torri_Hunter 2 Pittsburgh Pirates 2 home run
Torri_Hunter 3 Pittsburgh Pirates 2 flyout
Torri_Hunter 1 Pittsburgh Pirates 2 home run
Torri_Hunter 2 Pittsburgh Pirates 3 triple
Torri_Hunter 3 Pittsburgh Pirates 3 strikeout
Torri_Hunter 4 Pittsburgh Pirates 3 strikeout
Torri_Hunter 1 Detroit Tigers 1 home run
Torri_Hunter 2 Detroit Tigers 1 home run
Torri_Hunter 3 Detroit Tigers 1 home run
Torri_Hunter 4 Detroit Tigers 1 strikeout
(我发现鸟居的名字拼错了,请耐心等待)。
我最终想计算系列赛中本垒打的百分比,结果如下所示:
opponent_name game_1s game_2s game_3s
Torri Hunter Pittsburgh Pirates 25% 50% 0%
Torri Hunter Detroit Tigers 75% -- --
我可以 dplyr::filter 统计结果,通过 ID 计算每场比赛的统计数据,然后导出到 .csv,在那里我可以获得 excel 中的平均值(这就是我一直以来所做的)做),但必须有一种更快的方法来完全在 R 中完成此操作。有什么想法吗?
你可以这样做:
library(dplyr)
df %>%
group_by(player, opponent_name, game) %>%
summarise(p = sum(result == "home run") / n())
给出:
#Source: local data frame [4 x 4]
#Groups: player, opponent_name
#
# player opponent_name game p
#1 Torri_Hunter Detroit Tigers 1 0.75
#2 Torri_Hunter Pittsburgh Pirates 1 0.25
#3 Torri_Hunter Pittsburgh Pirates 2 0.50
#4 Torri_Hunter Pittsburgh Pirates 3 0.00
要匹配您想要的输出,您还可以这样做:
df %>%
group_by(player, opponent_name, game) %>%
summarise(p = mean(result == "home run")) %>%
tidyr::spread(game, p) %>%
arrange(desc(opponent_name)) %>%
setNames(c(names(.)[1:2], paste0("game_", names(.)[3:5], "s"))) %>%
mutate_each(funs(ifelse(is.na(.), "--", paste0(. * 100, "%"))), -(player:opponent_name))
给出:
#Source: local data frame [2 x 5]
#
# player opponent_name game_1s game_2s game_3s
#1 Torri_Hunter Pittsburgh Pirates 25% 50% 0%
#2 Torri_Hunter Detroit Tigers 75% -- --
要不写两个函数来帮你?假设您的数据框是 call df.
perc_res <- function(opponent, game="1" player="Torri_Hunter", result="home run"){
return(
dim(df[df$player==player & df$opponent==opponent & df$result==result & df$game==game,])[1]/
dim(df[df$player==player & df$opponent==opponent & df$game==game,])[1]
)
}
然后您可以制作一个看起来像
的输出数据框out.df <- data.frame(Opponent=levels(factor(df$opponent)), Player="Torri_Hunter")
out.df$game1s <- lapply(out.df$Opponent, perc_res, game=1)
等
如果以后想有更多的玩家,可以用mapply
.
ps:实际上 运行 还没有代码,所以可能仍然存在一些一般性错误。但我认为这至少应该让你入门!
一个 data.table
的铸造解决方案是
require(data.table)
setDT(dat)
percentage <- dat[,mean(result == "home run"), by = c("player", "opponent_name", "game")]
结果:
> percentage
player opponent_name game V1
1: Torri_Hunter Pittsburgh Pirates 1 0.25
2: Torri_Hunter Pittsburgh Pirates 2 0.50
3: Torri_Hunter Pittsburgh Pirates 3 0.00
4: Torri_Hunter Detroit Tigers 1 0.75
根据问题的要求将其转换为输出
require(reshape2)
dcast(percentage, player + opponent_name ~ game , value.var = "V1")
结果:
player opponent_name 1 2 3
1 Torri_Hunter Detroit Tigers 0.75 NA NA
2 Torri_Hunter Pittsburgh Pirates 0.25 0.5 0