根据变量和最早日期删除重复项并保留一行

Delete duplicates and keep one line depending on variable and earliest date

我需要对数据帧 d 进行子集化,并且我喜欢为每个 ID 号保留一行。但是保留的行应该包括AD或BD中的I50,并且只保留最早日期的行。

所以最后我们将得到包含两行 (ID:1&2) 和 AD/BD 中任一 I50 的数据框,以及最早的可能日期,因此日期为 2007-12-12 和 2009-12-12 .

我真的尝试了很多,但找不到解决方案。

ID <- c(1,1,1,1,1,2,2,2,2,2)
AD <- c("DJ400", "DJ300", "DI501", "DI509", "DR409",
          "DI509", "DJ200", "DA300", "DI500", "DR209")
Date <- as.Date(c("2010-12-12", "2011-12-12", "2007-12-12", "2008-12-12", "2009-12-12",
         "2011-12-12", "2012-12-12", "2008-12-12", "2009-12-12", "2010-12-12"))
BD <- c("DI509", "DI500", "DI401", "DI409", "DR609",
          "DI309", "DJ200", "DA300", "DI500", "DI509")
d <- data.frame(ID, AD, Date, BD)

hf <- subset(d, AD %in% "I50" | BD %in% "I50")

reprex package (v2.0.0)

创建于 2022-01-10

在第一个解决方案之后,我遇到了一些问题,我做了一些小改动,这里是新的 reprex。 我每个 ID 只需要一行。问题是有几个具有相同的日期,我之前没有包含。

ID <- c(1,1,1,1,1,2,2,2,2,2)
AD <- c("DJ400", "DJ300", "DI501", "DI509", "DR409",
        "DI509", "DJ200", "DA300", "DI500", "DR209")
Date <- as.Date(c("2010-12-12", "2011-12-12", "2010-12-12", "20012-12-12", "2009-12-12",
                  "2011-12-12", "2012-12-12", "2012-12-12", "2009-12-12", "2010-12-12"))
BD <- c("DI509", "DI500", "DI401", "DI409", "DR609",
        "DI309", "DJ200", "DA300", "DI500", "DI509")
d <- data.frame(ID, AD, Date, BD)

library(dplyr)

d %>%
  group_by(ID) %>%
  filter(if_any(c(AD, BD), ~ grepl("I50", .))) %>%
  slice_min(Date) %>%
  ungroup()
#> # A tibble: 3 x 4
#>      ID AD    Date       BD   
#>   <dbl> <chr> <date>     <chr>
#> 1     1 DJ400 2010-12-12 DI509
#> 2     1 DI501 2010-12-12 DI401
#> 3     2 DI500 2009-12-12 DI500

reprex package (v2.0.1)

创建于 2022-01-11

基础 R

d2 <- subset(d, grepl("I50", AD) | grepl("I50", BD))
do.call(rbind, lapply(split(d2, d2$ID), function(z) z[which.min(z$Date),]))
#   ID    AD       Date    BD
# 1  1 DI501 2007-12-12 DI401
# 2  2 DI500 2009-12-12 DI500

dplyr

library(dplyr)
d %>%
  group_by(ID) %>%
  filter(if_any(c(AD, BD), ~ grepl("I50", .))) %>%
  slice_min(Date) %>%
  ungroup()
# # A tibble: 2 x 4
#      ID AD    Date       BD   
#   <dbl> <chr> <date>     <chr>
# 1     1 DI501 2007-12-12 DI401
# 2     2 DI500 2009-12-12 DI500