随着团队数量的增加动态排名

Dynamic rankings with increasing number of teams

我的数据的小代表:

Date <- as.Date(rep(c("2015-05-14", "2015-05-15","2015-05-16"),c(4,2,1)))
TEAM1 <- c("GSW","SAS","MIL","ATL","GSW","SAC","LAL")
TEAM2 <- c("HOU","MIN","NOP","LAL","SAS","TOR","GSW")
PCW_TEAM1 <- c(0.88,0.72,0.34,0.46,0.87,0.28,0.24)
PCW_TEAM2 <- c(0.67,0.31,0.52,0.23,0.74,0.48,0.90)
df <- data.frame(cbind(Date,TEAM1,TEAM2,PCW_TEAM1,PCW_TEAM2), stringsAsFactors=F)
df

    Date TEAM1 TEAM2 PCW_TEAM1 PCW_TEAM2
 1 16569   GSW   HOU      0.88      0.67
 2 16569   SAS   MIN      0.72      0.31
 3 16569   MIL   NOP      0.34      0.52
 4 16569   ATL   LAL      0.46      0.23
 5 16570   GSW   SAS      0.87      0.74
 6 16570   SAC   TOR      0.28      0.48
 7 16571   LAL   GSW      0.24       0.9

想象一下,这是 NBA 赛季的前 7 场比赛。在第一个日期 (16569) 有四场比赛,因此排名将超出 8。但是,一旦我们添加下一个日期 (16570),就会有两场比赛,而且只有两支新球队,因为 GSW 和 SAS 已经参加了第一场比赛日期。

我想根据最后可用日期的胜率对独特的球队进行排名。输出将如下所示:

   Date TEAM1 TEAM2 PCW_TEAM1 PCW_TEAM2 RANK_TEAM1 RANK_TEAM2
1 16569   GSW   HOU      0.88      0.67          1          3
2 16569   SAS   MIN      0.72      0.31          2          7
3 16569   MIL   NOP      0.34      0.52          6          4
4 16569   ATL   LAL      0.46      0.23          5          8
5 16570   GSW   SAS      0.87      0.74          1          2
6 16570   SAC   TOR      0.28      0.48          9          5
7 16571   LAL   GSW      0.24       0.9         10          1

请注意,在第 5 行,GSW 的获胜百分比为 0.87,排名为 1。第一行的获胜百分比更高 (0.88),但也是 GSW。

在此示例中,有 7 场比赛和 10 支不同的球队。根据真实数据,有 30 个独特的团队。

unique(c(TEAM1,TEAM2))
[1] "GSW" "SAS" "MIL" "ATL" "SAC" "LAL" "HOU" "MIN" "NOP" "TOR"

我想创建一个向量来收集每个独特团队的最后可用获胜百分比,然后根据该信息对团队进行排名,但不知道该怎么做,也不知道这是否是最佳方法。

TEAMs <- c(TEAM1,TEAM2)
teamsall <- unique(TEAMs)
PCWs <- c(PCW_TEAM1,PCW_TEAM2)
Dates <- c(Date,Date)
u = order(sapply(1:length(teamsall),function(x) {u=match(TEAMs,teamsall)==x;PCWs[u][which.max(Dates[u])]}),decreasing=T)
df$RANK1 = match(TEAM1,teamsall[u])
df$RANK2 = match(TEAM2,teamsall[u])
df

我觉得这可能是其中一种方式

我认为以长格式处理这些数据会更容易。下面代码的总体思路是扩展数据,以便每个团队在每个日期都有一个条目(当他们没有参加比赛时填充 NA 值)。然后,将数据按球队进行分组,将胜率前移填充NA值(使用na.locf from zoo),以确定每个日期的排名。然后,转换回宽格式。

## Rearrange the data into a long format
long <- do.call(rbind, lapply(1:2, function(i)
    setNames(df[,c("Date", grep(i, names(df), value=T))], c("Date", "TEAM", "PCW"))))
long$index <- rep(1:nrow(df))  # used to transform back to wide

## Expand to include an entry for each team at each date
dat <- merge(expand.grid(Date=unique(long$Date), TEAM=unique(long$TEAM)), long, all=T)

## Fill in the NA values for each team, carrying forward previous win%
library(zoo)  # na.locf
dat <- cbind(do.call(rbind, lapply(split(dat, dat$TEAM), function(x)
    transform(x, PCW=na.locf(PCW, na.rm=F)))))

## Then, group by date and order (I would leave it in this form)
library(dplyr)
dat %>% group_by(Date) %>%
  mutate(RANK=match(TEAM, TEAM[order(PCW, decreasing = T)])) -> res

## Put it back into wide format
out <- do.call(rbind, lapply(split(res[,-which(names(res)=="index")], res$index), function(x)
    setNames(cbind(x[1,], x[2,-1]), c("Date", paste0(names(x)[-1], 1), paste0(names(x)[-1], 2)))))
#    Date TEAM1 PCW1 RANK1 TEAM2 PCW2 RANK2
# 1 16569   GSW 0.88     1   HOU 0.67     3
# 2 16569   SAS 0.72     2   MIN 0.31     7
# 3 16569   MIL 0.34     6   NOP 0.52     4
# 4 16569   ATL 0.46     5   LAL 0.23     8
# 5 16570   GSW 0.87     1   SAS 0.74     2
# 6 16570   SAC 0.28     9   TOR 0.48     5
# 7 16571   GSW  0.9     1   LAL 0.24    10

请注意,我没有采取预防措施来确保 TEAM1 和 TEAM2 保持正确的顺序,例如,在最后一行中,与您期望的输出相比,团队发生了切换。