将开始和结束范围中的重叠部分分离到自己的数据框行中

Separate overlaps in start and end range into own row of data frame

我有这个数据:

start_data <- data.frame(stringsAsFactors=FALSE,
                          Person = c(1, 1, 1, 1),
                          Event = c(1, 2, 3, 4),
                          Var1 = c(1, 2, 3, 5),
                          Var2 = c(7, 8, 9, 6),
                          Var3 = c(13, 14, 15, 7),
                          Start_Date = c("1/01/2020", "5/01/2020", "21/01/2020", "23/01/2020"),
                          End_Date = c("10/01/2020", "20/01/2020", "30/01/2020", "25/01/2020")
)
 
start_data
  Person Event Var1 Var2 Var3 Start_Date   End_Date
1      1     1    1    7   13  1/01/2020 10/01/2020
2      1     2    2    8   14  5/01/2020 20/01/2020
3      1     3    3    9   15 21/01/2020 30/01/2020
4      1     4    5    6    7 23/01/2020 25/01/2020

我想把它转换成这样:

end_data <- data.frame(stringsAsFactors=FALSE,
                        Person = c(1, 1, 1, 1, 1, 1),
                        Event = c("1", "1 AND 2", "2", "3", "3 AND 4", "3"),
                        Var1 = c("1", "1 AND 2", "2", "3", "3 AND 5", "3"),
                        Var2 = c("7", "7 AND 8", "8", "9", "9 AND 6", "9"),
                        Var3 = c(13, 14, 14, 15, 15, 15),
                        Start_Date = c("1/01/2020", "5/01/2020", "11/01/2020", "21/01/2020",
                                       "23/01/2020", "26/01/2020"),
                        End_Date = c("4/01/2020", "10/01/2020", "20/01/2020", "22/01/2020",
                                     "25/01/2020", "30/01/2020")
)
 
end_data
  Person   Event    Var1    Var2 Var3 Start_Date   End_Date
1      1       1       1       7   13  1/01/2020  4/01/2020
2      1 1 AND 2 1 AND 2 7 AND 8   14  5/01/2020 10/01/2020
3      1       2       2       8   14 11/01/2020 20/01/2020
4      1       3       3       9   15 21/01/2020 22/01/2020
5      1 3 AND 4 3 AND 5 9 AND 6   15 23/01/2020 25/01/2020
6      1       3       3       9   15 26/01/2020 30/01/2020

代码应扩展数据框以隔离重叠的 start_date 和 end_date 范围。当存在重叠范围时,它应该创建一个包含重叠数据的新行。因此,当查看最后的 table 时,应该没有 Start_Date 和 End_Date 范围相互重叠。此外,table 的结果应通过连接结果为 Event、Var1 和 Var3 进行汇总。 Var 3 也应该通过取重叠范围内的最大值来聚合。

理想情况下,我想将此代码应用于许多“人”,因此最好使用 group_by() 或嵌套 dplyr。

编辑:

对于以下问题中 3 个重叠时段的情况。它看起来像这样;

编辑2:

@ekoam 的解决方案非常接近。但是,它不处理以下示例。事件 5 概括了整个时期。因此,不应有缺失的范围。但是,分别缺少“2017-05-17”和“2017-06-11”的 'start' 和 'end'。

> trial_start_data <- data.frame(stringsAsFactors=FALSE,
+                          Person = c(1, 1, 1, 1),
+                          Event = c(5,6,7,8),
+                          Start_Date = as.Date(c("24/04/2017","09/05/2017","12/06/2017","21/06/2017"), "%d/%m/%Y"),
+                          End_Date = as.Date(c("28/09/2017","16/05/2017","21/06/2017","25/06/2017"), "%d/%m/%Y")
+ )
> 
> trial_start_data
  Person Event Start_Date   End_Date
1      1     5 2017-04-24 2017-09-28
2      1     6 2017-05-09 2017-05-16
3      1     7 2017-06-12 2017-06-21
4      1     8 2017-06-21 2017-06-25

> disjoint_subsets(trial_start_data$Start_Date, trial_start_data$End_Date)
       start        end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-06-12 2017-06-20
4 2017-06-21 2017-06-21
5 2017-06-22 2017-06-25
6 2017-06-26 2017-09-28

这里的主要问题是以有效的方式找到一组范围的所有不相交子集。考虑这个函数

disjoint_subsets <- function(starts, ends) {
  t1 <- min(starts)
  starts <- as.integer(starts - t1)
  ends <- as.integer(ends - t1) + 2L
  nvec <- ends - starts + 1L
  x <- sequence(nvec, starts) * 10L
  ends <- cumsum(nvec); starts <- ends - nvec + 1L
  x[ends] <- x[ends] - 9L; x[starts] <- x[starts] + 9L
  x <- sort(unique(x))
  b <- which(x %% 10L > 0L)
  lb <- x[head(b[!(b + 1L) %in% b], -1L)]
  ub <- x[tail(b[!(b - 1L) %in% b], -1L)]
  lb <- (lb + 9L * (lb %% 10L < 2L) + 1L) %/% 10L
  ub <- (ub - 9L * (ub %% 10L > 8L) - 1L) %/% 10L
  data.frame(start = lb + t1 - 1L, end = ub + t1 - 1L)
}
用法
> with(trial_start_data, disjoint_subsets(Start_Date, End_Date))
       start        end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-05-17 2017-06-11
4 2017-06-12 2017-06-20
5 2017-06-21 2017-06-21
6 2017-06-22 2017-06-25
7 2017-06-26 2017-09-28

不过,功能有点慢。仍有改进的余地。对于具有 10 万行的数据帧,它在大约一秒钟内找到所有不相交的子集。对于具有 100 万行的数据框,运行 需要 10-15 秒。查看基准

starts_e6 <- sample(Sys.Date() + -1000:1000, size = 1e6, T)
ends_e6 <- starts_e6 + sample.int(1000, 1e6, T)
starts_e5 <- sample(Sys.Date() + -1000:1000, size = 1e5, T)
ends_e5 <- starts_e5 + sample.int(1000, 1e5, T)
microbenchmark::microbenchmark(
  disjoint_subsets(starts_e6, ends_e6),
  disjoint_subsets(starts_e5, ends_e5), 
  times = 2L
)

Unit: milliseconds
                                 expr      min       lq      mean    median        uq       max neval cld
 disjoint_subsets(starts_e6, ends_e6) 11299.59 11299.59 11366.623 11366.623 11433.652 11433.652     2   b
 disjoint_subsets(starts_e5, ends_e5)   873.66   873.66  1028.057  1028.057  1182.455  1182.455     2  a 

base::unique 是这里的瓶颈。但是,如果我们能以某种方式使用更少的元素来表示一个区间,那么我们就可以节省很多时间。

剩下的只是小菜一碟。您可以使用 data.table::foverlap 执行 start_data 和所有不相交子集的非相等连接。然后,汇总加入的data.table得到你想要的end_data。例如,

library(data.table)

setDT(trial_start_data)[, c("Start_Date", "End_Date") := lapply(.SD, as.Date, "%d/%m/%Y"), .SDcols = c("Start_Date", "End_Date")]
dsubs = trial_start_data[, disjoint_subsets(Start_Date, End_Date)]; setDT(dsubs)

setkey(dsubs, start, end)
setkey(trial_start_data, Start_Date, End_Date)

foverlaps(dsubs, trial_start_data, type = "within")

输出

    Person Event Start_Date   End_Date      start        end
 1:      1     5 2017-04-24 2017-09-28 2017-04-24 2017-05-08
 2:      1     5 2017-04-24 2017-09-28 2017-05-09 2017-05-16
 3:      1     6 2017-05-09 2017-05-16 2017-05-09 2017-05-16
 4:      1     5 2017-04-24 2017-09-28 2017-05-17 2017-06-11
 5:      1     5 2017-04-24 2017-09-28 2017-06-12 2017-06-20
 6:      1     7 2017-06-12 2017-06-21 2017-06-12 2017-06-20
 7:      1     5 2017-04-24 2017-09-28 2017-06-21 2017-06-21
 8:      1     7 2017-06-12 2017-06-21 2017-06-21 2017-06-21
 9:      1     8 2017-06-21 2017-06-25 2017-06-21 2017-06-21
10:      1     5 2017-04-24 2017-09-28 2017-06-22 2017-06-25
11:      1     8 2017-06-21 2017-06-25 2017-06-22 2017-06-25
12:      1     5 2017-04-24 2017-09-28 2017-06-26 2017-09-28

只有 10 万行的数据框对于任何 data.table 函数来说应该不是问题。这也是迄今为止我能想到的最有效的方法。我将省略剩余的步骤,因为答案现在已经很长了。另外,我认为我对您之前的一篇帖子的回答涵盖了它们。