根据变量和最早日期删除重复项并保留一行
Delete duplicates and keep one line depending on variable and earliest date
我需要对数据帧 d 进行子集化,并且我喜欢为每个 ID 号保留一行。但是保留的行应该包括AD或BD中的I50,并且只保留最早日期的行。
所以最后我们将得到包含两行 (ID:1&2) 和 AD/BD 中任一 I50 的数据框,以及最早的可能日期,因此日期为 2007-12-12 和 2009-12-12 .
我真的尝试了很多,但找不到解决方案。
ID <- c(1,1,1,1,1,2,2,2,2,2)
AD <- c("DJ400", "DJ300", "DI501", "DI509", "DR409",
"DI509", "DJ200", "DA300", "DI500", "DR209")
Date <- as.Date(c("2010-12-12", "2011-12-12", "2007-12-12", "2008-12-12", "2009-12-12",
"2011-12-12", "2012-12-12", "2008-12-12", "2009-12-12", "2010-12-12"))
BD <- c("DI509", "DI500", "DI401", "DI409", "DR609",
"DI309", "DJ200", "DA300", "DI500", "DI509")
d <- data.frame(ID, AD, Date, BD)
hf <- subset(d, AD %in% "I50" | BD %in% "I50")
由 reprex package (v2.0.0)
创建于 2022-01-10
在第一个解决方案之后,我遇到了一些问题,我做了一些小改动,这里是新的 reprex。
我每个 ID 只需要一行。问题是有几个具有相同的日期,我之前没有包含。
ID <- c(1,1,1,1,1,2,2,2,2,2)
AD <- c("DJ400", "DJ300", "DI501", "DI509", "DR409",
"DI509", "DJ200", "DA300", "DI500", "DR209")
Date <- as.Date(c("2010-12-12", "2011-12-12", "2010-12-12", "20012-12-12", "2009-12-12",
"2011-12-12", "2012-12-12", "2012-12-12", "2009-12-12", "2010-12-12"))
BD <- c("DI509", "DI500", "DI401", "DI409", "DR609",
"DI309", "DJ200", "DA300", "DI500", "DI509")
d <- data.frame(ID, AD, Date, BD)
library(dplyr)
d %>%
group_by(ID) %>%
filter(if_any(c(AD, BD), ~ grepl("I50", .))) %>%
slice_min(Date) %>%
ungroup()
#> # A tibble: 3 x 4
#> ID AD Date BD
#> <dbl> <chr> <date> <chr>
#> 1 1 DJ400 2010-12-12 DI509
#> 2 1 DI501 2010-12-12 DI401
#> 3 2 DI500 2009-12-12 DI500
由 reprex package (v2.0.1)
创建于 2022-01-11
基础 R
d2 <- subset(d, grepl("I50", AD) | grepl("I50", BD))
do.call(rbind, lapply(split(d2, d2$ID), function(z) z[which.min(z$Date),]))
# ID AD Date BD
# 1 1 DI501 2007-12-12 DI401
# 2 2 DI500 2009-12-12 DI500
dplyr
library(dplyr)
d %>%
group_by(ID) %>%
filter(if_any(c(AD, BD), ~ grepl("I50", .))) %>%
slice_min(Date) %>%
ungroup()
# # A tibble: 2 x 4
# ID AD Date BD
# <dbl> <chr> <date> <chr>
# 1 1 DI501 2007-12-12 DI401
# 2 2 DI500 2009-12-12 DI500
我需要对数据帧 d 进行子集化,并且我喜欢为每个 ID 号保留一行。但是保留的行应该包括AD或BD中的I50,并且只保留最早日期的行。
所以最后我们将得到包含两行 (ID:1&2) 和 AD/BD 中任一 I50 的数据框,以及最早的可能日期,因此日期为 2007-12-12 和 2009-12-12 .
我真的尝试了很多,但找不到解决方案。
ID <- c(1,1,1,1,1,2,2,2,2,2)
AD <- c("DJ400", "DJ300", "DI501", "DI509", "DR409",
"DI509", "DJ200", "DA300", "DI500", "DR209")
Date <- as.Date(c("2010-12-12", "2011-12-12", "2007-12-12", "2008-12-12", "2009-12-12",
"2011-12-12", "2012-12-12", "2008-12-12", "2009-12-12", "2010-12-12"))
BD <- c("DI509", "DI500", "DI401", "DI409", "DR609",
"DI309", "DJ200", "DA300", "DI500", "DI509")
d <- data.frame(ID, AD, Date, BD)
hf <- subset(d, AD %in% "I50" | BD %in% "I50")
由 reprex package (v2.0.0)
创建于 2022-01-10在第一个解决方案之后,我遇到了一些问题,我做了一些小改动,这里是新的 reprex。 我每个 ID 只需要一行。问题是有几个具有相同的日期,我之前没有包含。
ID <- c(1,1,1,1,1,2,2,2,2,2)
AD <- c("DJ400", "DJ300", "DI501", "DI509", "DR409",
"DI509", "DJ200", "DA300", "DI500", "DR209")
Date <- as.Date(c("2010-12-12", "2011-12-12", "2010-12-12", "20012-12-12", "2009-12-12",
"2011-12-12", "2012-12-12", "2012-12-12", "2009-12-12", "2010-12-12"))
BD <- c("DI509", "DI500", "DI401", "DI409", "DR609",
"DI309", "DJ200", "DA300", "DI500", "DI509")
d <- data.frame(ID, AD, Date, BD)
library(dplyr)
d %>%
group_by(ID) %>%
filter(if_any(c(AD, BD), ~ grepl("I50", .))) %>%
slice_min(Date) %>%
ungroup()
#> # A tibble: 3 x 4
#> ID AD Date BD
#> <dbl> <chr> <date> <chr>
#> 1 1 DJ400 2010-12-12 DI509
#> 2 1 DI501 2010-12-12 DI401
#> 3 2 DI500 2009-12-12 DI500
由 reprex package (v2.0.1)
创建于 2022-01-11基础 R
d2 <- subset(d, grepl("I50", AD) | grepl("I50", BD))
do.call(rbind, lapply(split(d2, d2$ID), function(z) z[which.min(z$Date),]))
# ID AD Date BD
# 1 1 DI501 2007-12-12 DI401
# 2 2 DI500 2009-12-12 DI500
dplyr
library(dplyr)
d %>%
group_by(ID) %>%
filter(if_any(c(AD, BD), ~ grepl("I50", .))) %>%
slice_min(Date) %>%
ungroup()
# # A tibble: 2 x 4
# ID AD Date BD
# <dbl> <chr> <date> <chr>
# 1 1 DI501 2007-12-12 DI401
# 2 2 DI500 2009-12-12 DI500