R data.table 有条件地删除组中的行

R data.table remove rows conditionally among groups

我有这个示例数据集,实际有数百万行,所以我很感激 data.table 解决方案,但 tidyverse 解决方案也可以:

dat1 = data.frame(name = c("X1", "X1", "X1", "X2", "X2", "X2", "X2", "X2", "X2"), 
              year = c(2015,2016,2017,2015,2016,2016,2017,2017, 2018),
              choice = c("o","o","o","o","o","r","r","o","o")
)
dat1

我需要应用的逻辑是:

如果对于任何名称和年份组合仅存在选择 "o",则保留带有 "o" 的行。

如果对于任何名称和年份组合选项 "o""r" 存在,保留 "r" 行并删除 "o" 行。我不想命名 nameyear 组合。

这个有用吗:

library(dplyr)
dat1 %>% group_by(name ,year) %>% filter(all(choice == 'o' )|choice == 'r')
# A tibble: 7 x 3
# Groups:   name, year [7]
  name   year choice
  <chr> <dbl> <chr> 
1 X1     2015 o     
2 X1     2016 o     
3 X1     2017 o     
4 X2     2015 o     
5 X2     2016 r     
6 X2     2017 r     
7 X2     2018 o     
library(data.table)
setDT(dat1)
dat1[, .SD[all(choice == "o") | choice == "r",], by = .(name, year)]
#    name year choice
# 1:   X1 2015      o
# 2:   X1 2016      o
# 3:   X1 2017      o
# 4:   X2 2015      o
# 5:   X2 2016      r
# 6:   X2 2017      r
# 7:   X2 2018      o

(我在查看 KarthikS 的回答之前生成了这个,但是逻辑和结果是相同的。)

一个选项也是将列转换为 factor,在自定义顺序中指定 levels,然后 select first levels 删除后droplevels

的水平
library(dplyr)
dat1 %>%
     group_by(name, year) %>%
     filter(choice %in% levels(droplevels(factor(choice, 
           levels = c('r', 'o'))))[1])
# A tibble: 7 x 3
# Groups:   name, year [7]
#  name   year choice
#  <chr> <dbl> <chr> 
#1 X1     2015 o     
#2 X1     2016 o     
#3 X1     2017 o     
#4 X2     2015 o     
#5 X2     2016 r     
#6 X2     2017 r     
#7 X2     2018 o     

data.table 的等效选项是

library(data.table)
setDT(dat1)[dat1[, .I[choice %in% 
       levels(droplevels(factor(choice, 
           levels = c('r', 'o'))))[1]], .(name, year)]$V1]