有没有办法根据分类列和日期列过滤 R 中的行

Is there a way to filter a row in R based on a categorical column and a date column

我在 R 中有一个数据集看起来像 link 这个:

ClientID Category Date
Person1 CategoryA 2020-09-01
Person1 CategoryA 2020-09-30
Person2 CategoryA 2020-07-25
Person2 CategoryA 2020-08-31
Person1 CategoryB 2020-03-15
Person1 CategoryB 2020-09-14
Person2 CategoryB 2020-06-17
Person2 CategoryB 2020-10-10

我想做的是将其过滤为仅 return 每个客户的每个类别中最近日期的行,因此它看起来像这样:

ClientID Category Date
Person1 CategoryA 2020-09-30
Person1 CategoryB 2020-09-14
Person2 CategoryA 2020-08-31
Person2 CategoryB 2020-10-10

我试过了

library(plyr)

dataSet %>% filter(Category == "CategoryA",Date == max(Date))

我一输入就知道这行不通,但我不知道该去哪里。我已经考虑过对不同类别的数据进行子集化(只有 4 个),但是我仍然迷失了按每个客户的最大日期过滤(因为至少到那时,我可以 rbind() 每个子集的结果到一个最终数据 table)。但是,唉,我卡住了。

在此先感谢您的帮助。

尝试

library(dplyr)

dataSet %>% 
   group_by(Client_ID, Category) %>% 
   mutate(max_date=max(Date)) %>% 
   filter(Date==max_date)

我想你可以试试

df %>%
  group_by(ClientID, Category) %>%
  filter(Date == max(Date))

这给出了

  ClientID Category  Date
  <chr>    <chr>     <date>
1 Person1  CategoryA 2020-09-30
2 Person2  CategoryA 2020-08-31
3 Person1  CategoryB 2020-09-14
4 Person2  CategoryB 2020-10-10

使用 subset + ave

的基础 R 选项
subset(
  df,
  Date == ave(Date, ClientID, Category, FUN = max)
)

给予

  ClientID  Category       Date
2  Person1 CategoryA 2020-09-30
4  Person2 CategoryA 2020-08-31
6  Person1 CategoryB 2020-09-14
8  Person2 CategoryB 2020-10-10

一个data.table选项

> setDT(df)[, lapply(.SD, max), by = .(ClientID, Category)]
   ClientID  Category       Date
1:  Person1 CategoryA 2020-09-30
2:  Person2 CategoryA 2020-08-31
3:  Person1 CategoryB 2020-09-14
4:  Person2 CategoryB 2020-10-10

数据

> dput(df)
structure(list(ClientID = c("Person1", "Person1", "Person2", 
"Person2", "Person1", "Person1", "Person2", "Person2"), Category = c("CategoryA",
"CategoryA", "CategoryA", "CategoryA", "CategoryB", "CategoryB",
"CategoryB", "CategoryB"), Date = structure(c(18506, 18535, 18468,
18505, 18336, 18519, 18430, 18545), class = "Date")), row.names = c(NA,
-8L), class = "data.frame")

另一种选择是使用 dplyr::slice。谢谢.

我并不是说它比过滤器选项更好,但它只是另一种方式。也可以通过在切片中使用 which.max(Date) 而不是先安排来缩短它。

library(dplyr)

foodf <- structure(list(ClientID = c("Person1", "Person1", "Person2", 
                            "Person2", "Person1", "Person1", "Person2", "Person2"), Category = c("CategoryA",
                                                                                                 "CategoryA", "CategoryA", "CategoryA", "CategoryB", "CategoryB",
                                                                                                 "CategoryB", "CategoryB"), Date = structure(c(18506, 18535, 18468,
                                                                                                                                               18505, 18336, 18519, 18430, 18545), class = "Date")), row.names = c(NA,
                                                                                                                                                                                                                   -8L), class = "data.frame")
foodf %>% 
  arrange(ClientID, Category, Date) %>%
  group_by(ClientID, Category) %>%
  slice(n())
#> # A tibble: 4 x 3
#> # Groups:   ClientID, Category [4]
#>   ClientID Category  Date      
#>   <chr>    <chr>     <date>    
#> 1 Person1  CategoryA 2020-09-30
#> 2 Person1  CategoryB 2020-09-14
#> 3 Person2  CategoryA 2020-08-31
#> 4 Person2  CategoryB 2020-10-10