识别具有重叠日期的重复项
Identifying duplicates with overlapping dates
我有一个如下所示的数据集。 dates1和dates2分别对应一个程序的开始日期和结束日期。
id <- c(1, 2, 3, 3, 4, 4, 5)
dates1 <- as.Date(c("2020-01-01", "2020-03-01", "2020-01-01", "2020-02-01", "2020-01-15", "2020-03-01", "2020-03-01"))
dates2 <- as.Date(c("2020-06-15", "2020-07-17", "2020-04-05","2020-05-06", "2020-02-25","2020-05-31", "2020-03-17"))
dfx <- data.frame(id, dates1, dates2)
我正在尝试识别数据集中具有重叠日期的所有重复 ID。因此,从上面的数据集中,我只想提取 id 3 数据,因为它是唯一具有重复 id 和重叠日期的数据。
我希望输出是这样的:
id_dup <- c(3,3)
dates1_dup <- as.Date(c("2020-01-01", "2020-02-01"))
dates2_dup <- as.Date(c("2020-04-05","2020-05-06"))
dfx_dup <- data.frame(id_dup, dates1_dup, dates2_dup)
感谢任何帮助。谢谢!
基础 R
dfx[ave(dfx$id, dfx$id,
FUN = function(id) {
any(with(dfx[dfx$id == id[1],],
mapply(function(d1, d2) any(d1 > dates1 & d1 < dates2), dates1, dates2)))
}) > 0,]
# id dates1 dates2
# 3 3 2020-01-01 2020-04-05
# 4 3 2020-02-01 2020-05-06
dplyr
library(dplyr)
dfx %>%
group_by(id) %>%
filter(
any(mapply(function(d1, d2) any(d1 > dates1 & d1 < dates2), dates1, dates2))
) %>%
ungroup()
# # A tibble: 2 x 3
# id dates1 dates2
# <dbl> <date> <date>
# 1 3 2020-01-01 2020-04-05
# 2 3 2020-02-01 2020-05-06
另一个dplyr
解决方案
dfx %>%
group_by(id) %>%
filter(n() > 1 & all(dates2 >= lead(dates1), na.rm = T))
id dates1 dates2
<dbl> <date> <date>
1 3 2020-01-01 2020-04-05
2 3 2020-02-01 2020-05-06
我有一个如下所示的数据集。 dates1和dates2分别对应一个程序的开始日期和结束日期。
id <- c(1, 2, 3, 3, 4, 4, 5)
dates1 <- as.Date(c("2020-01-01", "2020-03-01", "2020-01-01", "2020-02-01", "2020-01-15", "2020-03-01", "2020-03-01"))
dates2 <- as.Date(c("2020-06-15", "2020-07-17", "2020-04-05","2020-05-06", "2020-02-25","2020-05-31", "2020-03-17"))
dfx <- data.frame(id, dates1, dates2)
我正在尝试识别数据集中具有重叠日期的所有重复 ID。因此,从上面的数据集中,我只想提取 id 3 数据,因为它是唯一具有重复 id 和重叠日期的数据。
我希望输出是这样的:
id_dup <- c(3,3)
dates1_dup <- as.Date(c("2020-01-01", "2020-02-01"))
dates2_dup <- as.Date(c("2020-04-05","2020-05-06"))
dfx_dup <- data.frame(id_dup, dates1_dup, dates2_dup)
感谢任何帮助。谢谢!
基础 R
dfx[ave(dfx$id, dfx$id,
FUN = function(id) {
any(with(dfx[dfx$id == id[1],],
mapply(function(d1, d2) any(d1 > dates1 & d1 < dates2), dates1, dates2)))
}) > 0,]
# id dates1 dates2
# 3 3 2020-01-01 2020-04-05
# 4 3 2020-02-01 2020-05-06
dplyr
library(dplyr)
dfx %>%
group_by(id) %>%
filter(
any(mapply(function(d1, d2) any(d1 > dates1 & d1 < dates2), dates1, dates2))
) %>%
ungroup()
# # A tibble: 2 x 3
# id dates1 dates2
# <dbl> <date> <date>
# 1 3 2020-01-01 2020-04-05
# 2 3 2020-02-01 2020-05-06
另一个dplyr
解决方案
dfx %>%
group_by(id) %>%
filter(n() > 1 & all(dates2 >= lead(dates1), na.rm = T))
id dates1 dates2
<dbl> <date> <date>
1 3 2020-01-01 2020-04-05
2 3 2020-02-01 2020-05-06