R data.table 有条件地删除组中的行
R data.table remove rows conditionally among groups
我有这个示例数据集,实际有数百万行,所以我很感激 data.table
解决方案,但 tidyverse
解决方案也可以:
dat1 = data.frame(name = c("X1", "X1", "X1", "X2", "X2", "X2", "X2", "X2", "X2"),
year = c(2015,2016,2017,2015,2016,2016,2017,2017, 2018),
choice = c("o","o","o","o","o","r","r","o","o")
)
dat1
我需要应用的逻辑是:
如果对于任何名称和年份组合仅存在选择 "o"
,则保留带有 "o"
的行。
如果对于任何名称和年份组合选项 "o"
和 "r"
存在,保留 "r"
行并删除 "o"
行。我不想命名 name
和 year
组合。
这个有用吗:
library(dplyr)
dat1 %>% group_by(name ,year) %>% filter(all(choice == 'o' )|choice == 'r')
# A tibble: 7 x 3
# Groups: name, year [7]
name year choice
<chr> <dbl> <chr>
1 X1 2015 o
2 X1 2016 o
3 X1 2017 o
4 X2 2015 o
5 X2 2016 r
6 X2 2017 r
7 X2 2018 o
library(data.table)
setDT(dat1)
dat1[, .SD[all(choice == "o") | choice == "r",], by = .(name, year)]
# name year choice
# 1: X1 2015 o
# 2: X1 2016 o
# 3: X1 2017 o
# 4: X2 2015 o
# 5: X2 2016 r
# 6: X2 2017 r
# 7: X2 2018 o
(我在查看 KarthikS 的回答之前生成了这个,但是逻辑和结果是相同的。)
一个选项也是将列转换为 factor
,在自定义顺序中指定 levels
,然后 select first
levels
删除后droplevels
的水平
library(dplyr)
dat1 %>%
group_by(name, year) %>%
filter(choice %in% levels(droplevels(factor(choice,
levels = c('r', 'o'))))[1])
# A tibble: 7 x 3
# Groups: name, year [7]
# name year choice
# <chr> <dbl> <chr>
#1 X1 2015 o
#2 X1 2016 o
#3 X1 2017 o
#4 X2 2015 o
#5 X2 2016 r
#6 X2 2017 r
#7 X2 2018 o
data.table
的等效选项是
library(data.table)
setDT(dat1)[dat1[, .I[choice %in%
levels(droplevels(factor(choice,
levels = c('r', 'o'))))[1]], .(name, year)]$V1]
我有这个示例数据集,实际有数百万行,所以我很感激 data.table
解决方案,但 tidyverse
解决方案也可以:
dat1 = data.frame(name = c("X1", "X1", "X1", "X2", "X2", "X2", "X2", "X2", "X2"),
year = c(2015,2016,2017,2015,2016,2016,2017,2017, 2018),
choice = c("o","o","o","o","o","r","r","o","o")
)
dat1
我需要应用的逻辑是:
如果对于任何名称和年份组合仅存在选择 "o"
,则保留带有 "o"
的行。
如果对于任何名称和年份组合选项 "o"
和 "r"
存在,保留 "r"
行并删除 "o"
行。我不想命名 name
和 year
组合。
这个有用吗:
library(dplyr)
dat1 %>% group_by(name ,year) %>% filter(all(choice == 'o' )|choice == 'r')
# A tibble: 7 x 3
# Groups: name, year [7]
name year choice
<chr> <dbl> <chr>
1 X1 2015 o
2 X1 2016 o
3 X1 2017 o
4 X2 2015 o
5 X2 2016 r
6 X2 2017 r
7 X2 2018 o
library(data.table)
setDT(dat1)
dat1[, .SD[all(choice == "o") | choice == "r",], by = .(name, year)]
# name year choice
# 1: X1 2015 o
# 2: X1 2016 o
# 3: X1 2017 o
# 4: X2 2015 o
# 5: X2 2016 r
# 6: X2 2017 r
# 7: X2 2018 o
(我在查看 KarthikS 的回答之前生成了这个,但是逻辑和结果是相同的。)
一个选项也是将列转换为 factor
,在自定义顺序中指定 levels
,然后 select first
levels
删除后droplevels
library(dplyr)
dat1 %>%
group_by(name, year) %>%
filter(choice %in% levels(droplevels(factor(choice,
levels = c('r', 'o'))))[1])
# A tibble: 7 x 3
# Groups: name, year [7]
# name year choice
# <chr> <dbl> <chr>
#1 X1 2015 o
#2 X1 2016 o
#3 X1 2017 o
#4 X2 2015 o
#5 X2 2016 r
#6 X2 2017 r
#7 X2 2018 o
data.table
的等效选项是
library(data.table)
setDT(dat1)[dat1[, .I[choice %in%
levels(droplevels(factor(choice,
levels = c('r', 'o'))))[1]], .(name, year)]$V1]