R 中具有特定 ID 的条目数的平均值?

Mean for number of entries with certain ID in R?

所以假设我在 R 中使用的数据集看起来像这样:

player      at_bat  opponent_name     game  result
Torri_Hunter    1   Pittsburgh Pirates  1   home run
Torri_Hunter    2   Pittsburgh Pirates  1   triple
Torri_Hunter    3   Pittsburgh Pirates  1   strikeout
Torri_Hunter    4   Pittsburgh Pirates  1   strikeout
Torri_Hunter    1   Pittsburgh Pirates  2   groundout
Torri_Hunter    2   Pittsburgh Pirates  2   home run
Torri_Hunter    3   Pittsburgh Pirates  2   flyout
Torri_Hunter    1   Pittsburgh Pirates  2   home run
Torri_Hunter    2   Pittsburgh Pirates  3   triple
Torri_Hunter    3   Pittsburgh Pirates  3   strikeout
Torri_Hunter    4   Pittsburgh Pirates  3   strikeout
Torri_Hunter    1   Detroit Tigers      1   home run
Torri_Hunter    2   Detroit Tigers      1   home run
Torri_Hunter    3   Detroit Tigers      1   home run
Torri_Hunter    4   Detroit Tigers      1   strikeout

(我发现鸟居的名字拼错了,请耐心等待)。

我最终想计算系列赛中本垒打的百分比,结果如下所示:

                opponent_name       game_1s game_2s game_3s
Torri Hunter    Pittsburgh Pirates  25%     50%     0%
Torri Hunter    Detroit Tigers      75%     --      --

我可以 dplyr::filter 统计结果,通过 ID 计算每场比赛的统计数据,然后导出到 .csv,在那里我可以获得 excel 中的平均值(这就是我一直以来所做的)做),但必须有一种更快的方法来完全在 R 中完成此操作。有什么想法吗?

你可以这样做:

library(dplyr)
df %>% 
  group_by(player, opponent_name, game) %>% 
  summarise(p = sum(result == "home run") / n()) 

给出:

#Source: local data frame [4 x 4]
#Groups: player, opponent_name
#
#        player      opponent_name game    p
#1 Torri_Hunter     Detroit Tigers    1 0.75
#2 Torri_Hunter Pittsburgh Pirates    1 0.25
#3 Torri_Hunter Pittsburgh Pirates    2 0.50
#4 Torri_Hunter Pittsburgh Pirates    3 0.00

要匹配您想要的输出,您还可以这样做:

df %>% 
  group_by(player, opponent_name, game) %>% 
  summarise(p = mean(result == "home run")) %>%
  tidyr::spread(game, p) %>%
  arrange(desc(opponent_name)) %>%
  setNames(c(names(.)[1:2], paste0("game_", names(.)[3:5], "s"))) %>%
  mutate_each(funs(ifelse(is.na(.), "--", paste0(. * 100, "%"))), -(player:opponent_name))

给出:

#Source: local data frame [2 x 5]
#
#        player      opponent_name game_1s game_2s game_3s
#1 Torri_Hunter Pittsburgh Pirates     25%     50%      0%
#2 Torri_Hunter     Detroit Tigers     75%      --      --

要不写两个函数来帮你?假设您的数据框是 call df.

perc_res <- function(opponent, game="1" player="Torri_Hunter", result="home run"){
  return(
   dim(df[df$player==player & df$opponent==opponent & df$result==result & df$game==game,])[1]/
      dim(df[df$player==player & df$opponent==opponent & df$game==game,])[1]
 )
}

然后您可以制作一个看起来像

的输出数据框
out.df <- data.frame(Opponent=levels(factor(df$opponent)), Player="Torri_Hunter")
out.df$game1s <- lapply(out.df$Opponent, perc_res, game=1)

等 如果以后想有更多的玩家,可以用mapply.

ps:实际上 运行 还没有代码,所以可能仍然存在一些一般性错误。但我认为这至少应该让你入门!

一个 data.table 的铸造解决方案是

require(data.table)
setDT(dat)
percentage <- dat[,mean(result == "home run"), by = c("player", "opponent_name", "game")]

结果:

> percentage

         player      opponent_name game   V1
1: Torri_Hunter Pittsburgh Pirates    1 0.25
2: Torri_Hunter Pittsburgh Pirates    2 0.50
3: Torri_Hunter Pittsburgh Pirates    3 0.00
4: Torri_Hunter     Detroit Tigers    1 0.75

根据问题的要求将其转换为输出

require(reshape2)
dcast(percentage, player + opponent_name ~ game , value.var = "V1")

结果:

        player      opponent_name    1   2  3
1 Torri_Hunter     Detroit Tigers 0.75  NA NA
2 Torri_Hunter Pittsburgh Pirates 0.25 0.5  0