有没有办法根据分类列和日期列过滤 R 中的行
Is there a way to filter a row in R based on a categorical column and a date column
我在 R 中有一个数据集看起来像 link 这个:
ClientID
Category
Date
Person1
CategoryA
2020-09-01
Person1
CategoryA
2020-09-30
Person2
CategoryA
2020-07-25
Person2
CategoryA
2020-08-31
Person1
CategoryB
2020-03-15
Person1
CategoryB
2020-09-14
Person2
CategoryB
2020-06-17
Person2
CategoryB
2020-10-10
我想做的是将其过滤为仅 return 每个客户的每个类别中最近日期的行,因此它看起来像这样:
ClientID
Category
Date
Person1
CategoryA
2020-09-30
Person1
CategoryB
2020-09-14
Person2
CategoryA
2020-08-31
Person2
CategoryB
2020-10-10
我试过了
library(plyr)
dataSet %>% filter(Category == "CategoryA",Date == max(Date))
我一输入就知道这行不通,但我不知道该去哪里。我已经考虑过对不同类别的数据进行子集化(只有 4 个),但是我仍然迷失了按每个客户的最大日期过滤(因为至少到那时,我可以 rbind()
每个子集的结果到一个最终数据 table)。但是,唉,我卡住了。
在此先感谢您的帮助。
尝试
library(dplyr)
dataSet %>%
group_by(Client_ID, Category) %>%
mutate(max_date=max(Date)) %>%
filter(Date==max_date)
我想你可以试试
df %>%
group_by(ClientID, Category) %>%
filter(Date == max(Date))
这给出了
ClientID Category Date
<chr> <chr> <date>
1 Person1 CategoryA 2020-09-30
2 Person2 CategoryA 2020-08-31
3 Person1 CategoryB 2020-09-14
4 Person2 CategoryB 2020-10-10
使用 subset
+ ave
的基础 R 选项
subset(
df,
Date == ave(Date, ClientID, Category, FUN = max)
)
给予
ClientID Category Date
2 Person1 CategoryA 2020-09-30
4 Person2 CategoryA 2020-08-31
6 Person1 CategoryB 2020-09-14
8 Person2 CategoryB 2020-10-10
一个data.table
选项
> setDT(df)[, lapply(.SD, max), by = .(ClientID, Category)]
ClientID Category Date
1: Person1 CategoryA 2020-09-30
2: Person2 CategoryA 2020-08-31
3: Person1 CategoryB 2020-09-14
4: Person2 CategoryB 2020-10-10
数据
> dput(df)
structure(list(ClientID = c("Person1", "Person1", "Person2",
"Person2", "Person1", "Person1", "Person2", "Person2"), Category = c("CategoryA",
"CategoryA", "CategoryA", "CategoryA", "CategoryB", "CategoryB",
"CategoryB", "CategoryB"), Date = structure(c(18506, 18535, 18468,
18505, 18336, 18519, 18430, 18545), class = "Date")), row.names = c(NA,
-8L), class = "data.frame")
另一种选择是使用 dplyr::slice
。谢谢.
我并不是说它比过滤器选项更好,但它只是另一种方式。也可以通过在切片中使用 which.max(Date) 而不是先安排来缩短它。
library(dplyr)
foodf <- structure(list(ClientID = c("Person1", "Person1", "Person2",
"Person2", "Person1", "Person1", "Person2", "Person2"), Category = c("CategoryA",
"CategoryA", "CategoryA", "CategoryA", "CategoryB", "CategoryB",
"CategoryB", "CategoryB"), Date = structure(c(18506, 18535, 18468,
18505, 18336, 18519, 18430, 18545), class = "Date")), row.names = c(NA,
-8L), class = "data.frame")
foodf %>%
arrange(ClientID, Category, Date) %>%
group_by(ClientID, Category) %>%
slice(n())
#> # A tibble: 4 x 3
#> # Groups: ClientID, Category [4]
#> ClientID Category Date
#> <chr> <chr> <date>
#> 1 Person1 CategoryA 2020-09-30
#> 2 Person1 CategoryB 2020-09-14
#> 3 Person2 CategoryA 2020-08-31
#> 4 Person2 CategoryB 2020-10-10
我在 R 中有一个数据集看起来像 link 这个:
ClientID | Category | Date |
---|---|---|
Person1 | CategoryA | 2020-09-01 |
Person1 | CategoryA | 2020-09-30 |
Person2 | CategoryA | 2020-07-25 |
Person2 | CategoryA | 2020-08-31 |
Person1 | CategoryB | 2020-03-15 |
Person1 | CategoryB | 2020-09-14 |
Person2 | CategoryB | 2020-06-17 |
Person2 | CategoryB | 2020-10-10 |
我想做的是将其过滤为仅 return 每个客户的每个类别中最近日期的行,因此它看起来像这样:
ClientID | Category | Date |
---|---|---|
Person1 | CategoryA | 2020-09-30 |
Person1 | CategoryB | 2020-09-14 |
Person2 | CategoryA | 2020-08-31 |
Person2 | CategoryB | 2020-10-10 |
我试过了
library(plyr)
dataSet %>% filter(Category == "CategoryA",Date == max(Date))
我一输入就知道这行不通,但我不知道该去哪里。我已经考虑过对不同类别的数据进行子集化(只有 4 个),但是我仍然迷失了按每个客户的最大日期过滤(因为至少到那时,我可以 rbind()
每个子集的结果到一个最终数据 table)。但是,唉,我卡住了。
在此先感谢您的帮助。
尝试
library(dplyr)
dataSet %>%
group_by(Client_ID, Category) %>%
mutate(max_date=max(Date)) %>%
filter(Date==max_date)
我想你可以试试
df %>%
group_by(ClientID, Category) %>%
filter(Date == max(Date))
这给出了
ClientID Category Date
<chr> <chr> <date>
1 Person1 CategoryA 2020-09-30
2 Person2 CategoryA 2020-08-31
3 Person1 CategoryB 2020-09-14
4 Person2 CategoryB 2020-10-10
使用 subset
+ ave
subset(
df,
Date == ave(Date, ClientID, Category, FUN = max)
)
给予
ClientID Category Date
2 Person1 CategoryA 2020-09-30
4 Person2 CategoryA 2020-08-31
6 Person1 CategoryB 2020-09-14
8 Person2 CategoryB 2020-10-10
一个data.table
选项
> setDT(df)[, lapply(.SD, max), by = .(ClientID, Category)]
ClientID Category Date
1: Person1 CategoryA 2020-09-30
2: Person2 CategoryA 2020-08-31
3: Person1 CategoryB 2020-09-14
4: Person2 CategoryB 2020-10-10
数据
> dput(df)
structure(list(ClientID = c("Person1", "Person1", "Person2",
"Person2", "Person1", "Person1", "Person2", "Person2"), Category = c("CategoryA",
"CategoryA", "CategoryA", "CategoryA", "CategoryB", "CategoryB",
"CategoryB", "CategoryB"), Date = structure(c(18506, 18535, 18468,
18505, 18336, 18519, 18430, 18545), class = "Date")), row.names = c(NA,
-8L), class = "data.frame")
另一种选择是使用 dplyr::slice
。谢谢
我并不是说它比过滤器选项更好,但它只是另一种方式。也可以通过在切片中使用 which.max(Date) 而不是先安排来缩短它。
library(dplyr)
foodf <- structure(list(ClientID = c("Person1", "Person1", "Person2",
"Person2", "Person1", "Person1", "Person2", "Person2"), Category = c("CategoryA",
"CategoryA", "CategoryA", "CategoryA", "CategoryB", "CategoryB",
"CategoryB", "CategoryB"), Date = structure(c(18506, 18535, 18468,
18505, 18336, 18519, 18430, 18545), class = "Date")), row.names = c(NA,
-8L), class = "data.frame")
foodf %>%
arrange(ClientID, Category, Date) %>%
group_by(ClientID, Category) %>%
slice(n())
#> # A tibble: 4 x 3
#> # Groups: ClientID, Category [4]
#> ClientID Category Date
#> <chr> <chr> <date>
#> 1 Person1 CategoryA 2020-09-30
#> 2 Person1 CategoryB 2020-09-14
#> 3 Person2 CategoryA 2020-08-31
#> 4 Person2 CategoryB 2020-10-10