如何删除 R 中具有单个唯一 ID 的行?
How to remove rows with single unique ID in R?
我有如下编码的数据集。对于一组特定的治疗对、年、月、水平,我分配了一个唯一的ID
。理想情况下,一个完整的“集合”有两行对应于相同的唯一 ID。如果没有,我想删除这些行。
所以在这里,除了对应ID2
之外,所有“套”的两组唯一ID。如何在我的原始数据集中,我有数千行这样的行。我怎样才能扫描以删除这些 singeltons?
tmt.pair <- c("A","A","A","B","B","B","B")
tmt <- c("1000 C","4000 C","1000 C","1000 C","4000 C","1000 C","4000 C")
year <- c("2021","2021","2021","2021","2021","2020","2020")
month <- c("A","A","A","J","J","O","O")
level <- c("Low","Low","Up","Low","Low","Low","Low")
site <- c(1,1,2,1,1,1,1)
val <- rnorm(7,5,1)
df <- data.frame(tmt.pair, year,month, level,tmt,val)
df$ID <- cumsum(!duplicated(df[1:4]))
tmt.pair year month level tmt val ID
1 A 2021 A Low 1000 C 4.789715 1
2 A 2021 A Low 4000 C 6.451113 1
3 A 2021 A Up 1000 C 4.281171 2
4 B 2021 J Low 1000 C 5.176668 3
5 B 2021 J Low 4000 C 6.384432 3
6 B 2020 O Low 1000 C 4.833731 4
7 B 2020 O Low 4000 C 3.274355 4
使用 base R 你可以这样做:
tab=table(df$ID)
df[ifelse(tab[df$ID]==1, FALSE, TRUE),]
输出:
tmt.pair year month level tmt val ID
1 A 2021 A Low 1000 C 5.156294 1
2 A 2021 A Low 4000 C 4.395990 1
4 B 2021 J Low 1000 C 5.714170 3
5 B 2021 J Low 4000 C 6.075886 3
6 B 2020 O Low 1000 C 7.249756 4
7 B 2020 O Low 4000 C 5.197891 4
使用 data.table
的另一个选项:
library(data.table)
setDT(df)[,if(.N > 1) .SD, by=ID]
输出
ID tmt.pair year month level tmt val
1: 1 A 2021 A Low 1000 C 4.424811
2: 1 A 2021 A Low 4000 C 4.556058
3: 3 B 2021 J Low 1000 C 4.396996
4: 3 B 2021 J Low 4000 C 3.906065
5: 4 B 2020 O Low 1000 C 5.714706
6: 4 B 2020 O Low 4000 C 4.891188
或者使用 dplyr
,我们只保留 ID
具有超过 1 个观察值的:
library(dplyr)
df %>%
group_by(ID) %>%
filter(n() > 1)
我有如下编码的数据集。对于一组特定的治疗对、年、月、水平,我分配了一个唯一的ID
。理想情况下,一个完整的“集合”有两行对应于相同的唯一 ID。如果没有,我想删除这些行。
所以在这里,除了对应ID2
之外,所有“套”的两组唯一ID。如何在我的原始数据集中,我有数千行这样的行。我怎样才能扫描以删除这些 singeltons?
tmt.pair <- c("A","A","A","B","B","B","B")
tmt <- c("1000 C","4000 C","1000 C","1000 C","4000 C","1000 C","4000 C")
year <- c("2021","2021","2021","2021","2021","2020","2020")
month <- c("A","A","A","J","J","O","O")
level <- c("Low","Low","Up","Low","Low","Low","Low")
site <- c(1,1,2,1,1,1,1)
val <- rnorm(7,5,1)
df <- data.frame(tmt.pair, year,month, level,tmt,val)
df$ID <- cumsum(!duplicated(df[1:4]))
tmt.pair year month level tmt val ID
1 A 2021 A Low 1000 C 4.789715 1
2 A 2021 A Low 4000 C 6.451113 1
3 A 2021 A Up 1000 C 4.281171 2
4 B 2021 J Low 1000 C 5.176668 3
5 B 2021 J Low 4000 C 6.384432 3
6 B 2020 O Low 1000 C 4.833731 4
7 B 2020 O Low 4000 C 3.274355 4
使用 base R 你可以这样做:
tab=table(df$ID)
df[ifelse(tab[df$ID]==1, FALSE, TRUE),]
输出:
tmt.pair year month level tmt val ID
1 A 2021 A Low 1000 C 5.156294 1
2 A 2021 A Low 4000 C 4.395990 1
4 B 2021 J Low 1000 C 5.714170 3
5 B 2021 J Low 4000 C 6.075886 3
6 B 2020 O Low 1000 C 7.249756 4
7 B 2020 O Low 4000 C 5.197891 4
使用 data.table
的另一个选项:
library(data.table)
setDT(df)[,if(.N > 1) .SD, by=ID]
输出
ID tmt.pair year month level tmt val
1: 1 A 2021 A Low 1000 C 4.424811
2: 1 A 2021 A Low 4000 C 4.556058
3: 3 B 2021 J Low 1000 C 4.396996
4: 3 B 2021 J Low 4000 C 3.906065
5: 4 B 2020 O Low 1000 C 5.714706
6: 4 B 2020 O Low 4000 C 4.891188
或者使用 dplyr
,我们只保留 ID
具有超过 1 个观察值的:
library(dplyr)
df %>%
group_by(ID) %>%
filter(n() > 1)