标记重复行的重叠时间段

mark overlapping time periods for duplicate rows

我无法识别重叠的日期。我有一个非常大的数据集,其中包含大量重复项。这是一个示例

A tibble: 20 x 9 (df)
   pat_id praktijk_id Geslacht Geboortedatum Inschrijfdatum Uitschrijfdatum inschrijf_first inschrijf_last Dup_N
    <int>       <int> <chr>    <chr>         <date>         <date>          <date>          <date>         <int>
 1  17895           5 V        2013-09-01    2013-09-24     2022-01-01      2013-10-01      2019-04-01        19
 2 303260          29 V        2013-09-01    2019-08-26     2022-01-01      2019-10-01      2020-10-01        19
 3 175084          53 M        1995-01-01    2019-08-28     2022-01-01      2019-10-01      2020-01-01       501
 4  14050          51 M        1995-01-01    2020-04-14     2022-01-01      2020-07-01      2020-10-01       501
 5 418695          47 M        1986-11-01    2001-06-29     2022-01-01      2014-07-01      2020-07-01       557
 6 242117          22 M        1986-11-01    2020-09-22     2022-01-01      2020-10-01      NA               557
 7 310296          24 M        1967-10-01    2013-08-30     2022-01-01      2014-01-01      2020-10-01       661
 8  17275          61 M        1967-10-01    2012-01-11     2013-09-01      2013-01-01      2013-07-01       661
 9 121166           5 M        1991-07-01    1999-09-01     2015-03-31      2006-01-01      2015-01-01       975
10 113649          27 M        1991-07-01    2018-04-18     2022-01-01      2018-07-01      2020-10-01       975
11 149016          66 V        1985-01-01    2015-05-12     2016-02-29      2015-07-01      2016-01-01      1067
12  47918          40 V        1985-01-01    2016-09-02     2022-01-01      2016-10-01      2020-10-01      1067
13 423021          44 V        1992-07-01    2014-03-03     2014-05-27      2014-04-01      NA              1079
14  31504          44 V        1992-07-01    2014-04-11     2016-02-18      2014-07-01      2016-01-01      1079
15 194200           4 M        1969-10-01    1996-09-30     2022-01-01      2006-01-01      2016-04-01      1195
16 261328           1 M        1969-10-01    2015-10-21     2016-01-19      2012-07-01      2016-01-01      1195
17 131040          39 V        1993-07-01    2006-09-06     2015-01-30      2014-01-01      2015-01-01      1352
18 321094          40 V        1993-07-01    2020-09-15     2022-01-01      2020-10-01      NA              1352
19 494575          39 V        1996-01-01    2013-10-23     2015-12-10      2014-01-01      2015-10-01      1355
20  58560          38 V        1996-01-01    2015-12-09     2022-01-01      2016-01-01      2020-10-01      1355


df <- structure(list(pat_id = c(17895L, 303260L, 175084L, 14050L, 418695L, 
242117L, 310296L, 17275L, 121166L, 113649L, 149016L, 47918L, 
423021L, 31504L, 194200L, 261328L, 131040L, 321094L, 494575L, 
58560L), praktijk_id = c(5L, 29L, 53L, 51L, 47L, 22L, 24L, 61L, 
5L, 27L, 66L, 40L, 44L, 44L, 4L, 1L, 39L, 40L, 39L, 38L), Geslacht = c("V", 
"V", "M", "M", "M", "M", "M", "M", "M", "M", "V", "V", "V", "V", 
"M", "M", "V", "V", "V", "V"), Geboortedatum = c("2013-09-01", 
"2013-09-01", "1995-01-01", "1995-01-01", "1986-11-01", "1986-11-01", 
"1967-10-01", "1967-10-01", "1991-07-01", "1991-07-01", "1985-01-01", 
"1985-01-01", "1992-07-01", "1992-07-01", "1969-10-01", "1969-10-01", 
"1993-07-01", "1993-07-01", "1996-01-01", "1996-01-01"), Inschrijfdatum = structure(c(15972, 
18134, 18136, 18366, 11502, 18527, 15947, 15350, 10835, 17639, 
16567, 17046, 16132, 16171, 9769, 16729, 13397, 18520, 16001, 
16778), class = "Date"), Uitschrijfdatum = structure(c(18993, 
18993, 18993, 18993, 18993, 18993, 18993, 15949, 16525, 18993, 
16860, 18993, 16217, 16849, 18993, 16819, 16465, 18993, 16779, 
18993), class = "Date"), inschrijf_first = structure(c(15979, 
18170, 18170, 18444, 16252, 18536, 16071, 15706, 13149, 17713, 
16617, 17075, 16161, 16252, 13149, 15522, 16071, 18536, 16071, 
16801), class = "Date"), inschrijf_last = structure(c(17987, 
18536, 18262, 18536, 18444, NA, 18536, 15887, 16436, 18536, 16801, 
18536, NA, 16801, 16892, 16801, 16436, NA, 16709, 18536), class = "Date"), 
    Dup_N = c(19L, 19L, 501L, 501L, 557L, 557L, 661L, 661L, 975L, 
    975L, 1067L, 1067L, 1079L, 1079L, 1195L, 1195L, 1352L, 1352L, 
    1355L, 1355L)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

我想为每个副本 Dup_N.

确定列 inschrijf_first 和 inschrijf_last 之间的重叠时间段
|---- DateRange Dup_N duplicate (1) -----|                        _ 
_                          |---Date Range Dup_N duplicate (2) ----|
_                                          |---Date Range Dup_N duplicate (3) ----|
                                                                                 

如果日期 'touch',例如重复的第一行 inschrijf_first == inschrijf_last,则不视为重叠。 (见下文)

|---- DateRange Dup_N duplicate (1) -----|                        _ 
_                                        |---Date Range Dup_N duplicate (2) ----|

预期的输出将是一个添加的列,标记重复的 Dup_N 行以进行重叠。

我尝试了以下代码:它标记了重叠的时间段。但是它也标记了 'touching' 日期(如上所述)。

 sqldf("select 
    a.*, 
    count(b.rowid) > 0 as overlap, 
    coalesce(group_concat(b.rowid), '') as overlaps
  from df a
  left join df b on a.Dup_N = b.Dup_N and 
                     not a.rowid = b.rowid and
                     ((a.inschrijf_first between b.inschrijf_first and b.inschrijf_last) or
                     (b.inschrijf_first between a.inschrijf_first and a.inschrijf_last))
  group by a.rowid
  order by a.rowid")

标记具有重叠时间段的重复项后,我想合并 Dup_N 个没有重叠的重复项。

亲切的问候!

使用 tidyverse,这会为您提供带有标记的重叠行的列(仅显示最后 3 列)。

一些备注:

  • 这仅适用于 2 个重复项(2 行),因为它没有定义当一个重叠而另一个不重叠(超过 3 行)时该怎么做。
  • 没有定义范围(只有一个日期)。现在它正在返回 NA.
  • 不清楚如何“合并”non-overlapping 行。
library(dplyr)
library(tidyr)

df %>% 
  group_by(Dup_N) %>% 
  mutate( overlap=(lead(inschrijf_first)>inschrijf_first & 
                   lead(inschrijf_first)<inschrijf_last)|
                  (lead(inschrijf_last)>inschrijf_first & 
                   lead(inschrijf_last)<inschrijf_last) ) %>% 
  fill(overlap, .direction="downup") %>% 
  ungroup() %>%
  print(width=90)
# A tibble: 20 × 10
...    inschrijf_first inschrijf_last Dup_N overlap
...    <date>          <date>         <int> <lgl>  
...  1 2013-10-01      2019-04-01        19 FALSE  
...  2 2019-10-01      2020-10-01        19 FALSE  
...  3 2019-10-01      2020-01-01       501 FALSE  
...  4 2020-07-01      2020-10-01       501 FALSE  
...  5 2014-07-01      2020-07-01       557 NA     
...  6 2020-10-01      NA               557 NA     
...  7 2014-01-01      2020-10-01       661 FALSE  
...  8 2013-01-01      2013-07-01       661 FALSE  
...  9 2006-01-01      2015-01-01       975 FALSE  
... 10 2018-07-01      2020-10-01       975 FALSE  
... 11 2015-07-01      2016-01-01      1067 FALSE  
... 12 2016-10-01      2020-10-01      1067 FALSE  
... 13 2014-04-01      NA              1079 NA     
... 14 2014-07-01      2016-01-01      1079 NA     
... 15 2006-01-01      2016-04-01      1195 TRUE   
... 16 2012-07-01      2016-01-01      1195 TRUE   
... 17 2014-01-01      2015-01-01      1352 NA     
... 18 2020-10-01      NA              1352 NA     
... 19 2014-01-01      2015-10-01      1355 FALSE  
... 20 2016-01-01      2020-10-01      1355 FALSE