识别具有重叠日期的重复项

Identifying duplicates with overlapping dates

我有一个如下所示的数据集。 dates1和dates2分别对应一个程序的开始日期和结束日期。

id <- c(1, 2, 3, 3, 4, 4, 5)
dates1 <- as.Date(c("2020-01-01", "2020-03-01", "2020-01-01", "2020-02-01", "2020-01-15", "2020-03-01", "2020-03-01"))
dates2 <- as.Date(c("2020-06-15", "2020-07-17", "2020-04-05","2020-05-06", "2020-02-25","2020-05-31", "2020-03-17"))

dfx <- data.frame(id, dates1, dates2)

我正在尝试识别数据集中具有重叠日期的所有重复 ID。因此,从上面的数据集中,我只想提取 id 3 数据,因为它是唯一具有重复 id 和重叠日期的数据。

我希望输出是这样的:

id_dup <- c(3,3)
dates1_dup <- as.Date(c("2020-01-01", "2020-02-01"))
dates2_dup <- as.Date(c("2020-04-05","2020-05-06"))

dfx_dup <- data.frame(id_dup, dates1_dup, dates2_dup)

感谢任何帮助。谢谢!

基础 R

dfx[ave(dfx$id, dfx$id,
    FUN = function(id) {
      any(with(dfx[dfx$id == id[1],],
               mapply(function(d1, d2) any(d1 > dates1 & d1 < dates2), dates1, dates2)))
    }) > 0,]
#   id     dates1     dates2
# 3  3 2020-01-01 2020-04-05
# 4  3 2020-02-01 2020-05-06

dplyr

library(dplyr)
dfx %>%
  group_by(id) %>%
  filter(
    any(mapply(function(d1, d2) any(d1 > dates1 & d1 < dates2), dates1, dates2))
  ) %>%
  ungroup()
# # A tibble: 2 x 3
#      id dates1     dates2    
#   <dbl> <date>     <date>    
# 1     3 2020-01-01 2020-04-05
# 2     3 2020-02-01 2020-05-06

另一个dplyr解决方案

dfx %>% 
  group_by(id) %>% 
  filter(n() > 1 & all(dates2 >= lead(dates1), na.rm = T))

     id dates1     dates2    
  <dbl> <date>     <date>    
1     3 2020-01-01 2020-04-05
2     3 2020-02-01 2020-05-06