如何 select 'x' R 中每组的最新值?

How to select 'x' most recent values in each group in R?

我正在尝试 select/filter R 数据框中每个组中的最新值。例如,我想 select 3 个最近的值(即最接近的日期到今天)来自以下数据框中的每个名称组:

Player  Date    Result
 Sam    03/15/2015  1
 Sam    03/22/2015  0
 Sam    04/04/2015  2
 Sam    04/12/2015  1
 Sam    04/18/2015  1
 Sam    04/26/2015  0
 Sam    08/08/2015  3
Steve   02/17/2015  0
Steve   02/21/2015  0
Steve   03/04/2015  4
Steve   03/11/2015  2
Steve   03/15/2015  1
Steve   03/22/2015  0
Steve   04/12/2015  0
Steve   04/18/2015  2
Steve   04/26/2015  1
Steve   04/29/2015  2
Steve   08/16/2015  4
Jasper  03/15/2015  3
Jasper  03/22/2015  3.5
Jasper  04/04/2015  4
Jasper  04/12/2015  4
Jasper  04/18/2015  5
Jasper  04/26/2015  0

我已经编写了 as.date() 代码,因此 R 现在可以理解日期格式,但是我现在可以使用什么代码来仅 select 每组中的 3 个(比方说)最新值?

我们可以使用dplyr。我们使用 as.Date 将 'Date' 转换为 Date class。按 'Player' 分组后,我们 arrange 'Date' 列降序排列,并使用 slice 获取最近的 3 个值。如果我们不想更改 'Date' class,我们可以删除 mutate 步骤并在 arrange 内进行转换,即 arrange(desc(as.Date(Date, '%m/%d/%Y')))

library(dplyr)
df1 %>%
    mutate(Date=as.Date(Date, '%m/%d/%Y')) %>% 
    group_by(Player) %>% 
    arrange(desc(Date)) %>% 
    slice(1:3)
#    Player       Date Result
#1 Jasper 2015-04-26      0
#2 Jasper 2015-04-18      5
#3 Jasper 2015-04-12      4
#4    Sam 2015-08-08      3
#5    Sam 2015-04-26      0
#6    Sam 2015-04-18      1
#7  Steve 2015-08-16      4
#8  Steve 2015-04-29      2
#9  Steve 2015-04-26      1

或者按 'Player' 分组后,我们可以通过指定 'n' 和 'wt' 变量进行排序来使用 top_n

 df1 %>% 
   mutate(Date=as.Date(Date, '%m/%d/%Y')) %>%
   group_by(Player)  %>%
   top_n(n = 3, Date)
#  Player       Date Result
#1    Sam 2015-04-18      1
#2    Sam 2015-04-26      0
#3    Sam 2015-08-08      3
#4  Steve 2015-04-26      1
#5  Steve 2015-04-29      2
#6  Steve 2015-08-16      4
#7 Jasper 2015-04-12      4
#8 Jasper 2015-04-18      5
#9 Jasper 2015-04-26      0

使用 data.table,我们将 'data.frame' 转换为 'data.table' (setDT(df1))。通过'Player分组,我们将order转换为Dateclass后的'Date',利用head可以得到每组的前3行.

library(data.table)
setDT(df1)[order(-as.IDate(Date, '%m/%d/%Y')),head(.SD, 3) , by = Player]
#   Player       Date Result
#1:  Steve 08/16/2015      4
#2:  Steve 04/29/2015      2
#3:  Steve 04/26/2015      1
#4:    Sam 08/08/2015      3
#5:    Sam 04/26/2015      0
#6:    Sam 04/18/2015      1
#7: Jasper 04/26/2015      0
#8: Jasper 04/18/2015      5
#9: Jasper 04/12/2015      4

数据

df1 <- structure(list(Player = c("Sam", "Sam", "Sam", "Sam", "Sam", 
"Sam", "Sam", "Steve", "Steve", "Steve", "Steve", "Steve", "Steve", 
"Steve", "Steve", "Steve", "Steve", "Steve", "Jasper", "Jasper", 
"Jasper", "Jasper", "Jasper", "Jasper"), Date = c("03/15/2015", 
"03/22/2015", "04/04/2015", "04/12/2015", "04/18/2015", "04/26/2015", 
"08/08/2015", "02/17/2015", "02/21/2015", "03/04/2015", "03/11/2015", 
"03/15/2015", "03/22/2015", "04/12/2015", "04/18/2015", "04/26/2015", 
"04/29/2015", "08/16/2015", "03/15/2015", "03/22/2015", "04/04/2015", 
"04/12/2015", "04/18/2015", "04/26/2015"), Result = c(1, 0, 2, 
1, 1, 0, 3, 0, 0, 4, 2, 1, 0, 0, 2, 1, 2, 4, 3, 3.5, 4, 4, 5, 
0)), .Names = c("Player", "Date", "Result"),
class = "data.frame", row.names = c(NA,  -24L))