将开始和结束范围中的重叠部分分离到自己的数据框行中
Separate overlaps in start and end range into own row of data frame
我有这个数据:
start_data <- data.frame(stringsAsFactors=FALSE,
Person = c(1, 1, 1, 1),
Event = c(1, 2, 3, 4),
Var1 = c(1, 2, 3, 5),
Var2 = c(7, 8, 9, 6),
Var3 = c(13, 14, 15, 7),
Start_Date = c("1/01/2020", "5/01/2020", "21/01/2020", "23/01/2020"),
End_Date = c("10/01/2020", "20/01/2020", "30/01/2020", "25/01/2020")
)
start_data
Person Event Var1 Var2 Var3 Start_Date End_Date
1 1 1 1 7 13 1/01/2020 10/01/2020
2 1 2 2 8 14 5/01/2020 20/01/2020
3 1 3 3 9 15 21/01/2020 30/01/2020
4 1 4 5 6 7 23/01/2020 25/01/2020
我想把它转换成这样:
end_data <- data.frame(stringsAsFactors=FALSE,
Person = c(1, 1, 1, 1, 1, 1),
Event = c("1", "1 AND 2", "2", "3", "3 AND 4", "3"),
Var1 = c("1", "1 AND 2", "2", "3", "3 AND 5", "3"),
Var2 = c("7", "7 AND 8", "8", "9", "9 AND 6", "9"),
Var3 = c(13, 14, 14, 15, 15, 15),
Start_Date = c("1/01/2020", "5/01/2020", "11/01/2020", "21/01/2020",
"23/01/2020", "26/01/2020"),
End_Date = c("4/01/2020", "10/01/2020", "20/01/2020", "22/01/2020",
"25/01/2020", "30/01/2020")
)
end_data
Person Event Var1 Var2 Var3 Start_Date End_Date
1 1 1 1 7 13 1/01/2020 4/01/2020
2 1 1 AND 2 1 AND 2 7 AND 8 14 5/01/2020 10/01/2020
3 1 2 2 8 14 11/01/2020 20/01/2020
4 1 3 3 9 15 21/01/2020 22/01/2020
5 1 3 AND 4 3 AND 5 9 AND 6 15 23/01/2020 25/01/2020
6 1 3 3 9 15 26/01/2020 30/01/2020
代码应扩展数据框以隔离重叠的 start_date 和 end_date 范围。当存在重叠范围时,它应该创建一个包含重叠数据的新行。因此,当查看最后的 table 时,应该没有 Start_Date 和 End_Date 范围相互重叠。此外,table 的结果应通过连接结果为 Event、Var1 和 Var3 进行汇总。 Var 3 也应该通过取重叠范围内的最大值来聚合。
理想情况下,我想将此代码应用于许多“人”,因此最好使用 group_by() 或嵌套 dplyr。
编辑:
对于以下问题中 3 个重叠时段的情况。它看起来像这样;
编辑2:
@ekoam 的解决方案非常接近。但是,它不处理以下示例。事件 5 概括了整个时期。因此,不应有缺失的范围。但是,分别缺少“2017-05-17”和“2017-06-11”的 'start' 和 'end'。
> trial_start_data <- data.frame(stringsAsFactors=FALSE,
+ Person = c(1, 1, 1, 1),
+ Event = c(5,6,7,8),
+ Start_Date = as.Date(c("24/04/2017","09/05/2017","12/06/2017","21/06/2017"), "%d/%m/%Y"),
+ End_Date = as.Date(c("28/09/2017","16/05/2017","21/06/2017","25/06/2017"), "%d/%m/%Y")
+ )
>
> trial_start_data
Person Event Start_Date End_Date
1 1 5 2017-04-24 2017-09-28
2 1 6 2017-05-09 2017-05-16
3 1 7 2017-06-12 2017-06-21
4 1 8 2017-06-21 2017-06-25
> disjoint_subsets(trial_start_data$Start_Date, trial_start_data$End_Date)
start end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-06-12 2017-06-20
4 2017-06-21 2017-06-21
5 2017-06-22 2017-06-25
6 2017-06-26 2017-09-28
这里的主要问题是以有效的方式找到一组范围的所有不相交子集。考虑这个函数
disjoint_subsets <- function(starts, ends) {
t1 <- min(starts)
starts <- as.integer(starts - t1)
ends <- as.integer(ends - t1) + 2L
nvec <- ends - starts + 1L
x <- sequence(nvec, starts) * 10L
ends <- cumsum(nvec); starts <- ends - nvec + 1L
x[ends] <- x[ends] - 9L; x[starts] <- x[starts] + 9L
x <- sort(unique(x))
b <- which(x %% 10L > 0L)
lb <- x[head(b[!(b + 1L) %in% b], -1L)]
ub <- x[tail(b[!(b - 1L) %in% b], -1L)]
lb <- (lb + 9L * (lb %% 10L < 2L) + 1L) %/% 10L
ub <- (ub - 9L * (ub %% 10L > 8L) - 1L) %/% 10L
data.frame(start = lb + t1 - 1L, end = ub + t1 - 1L)
}
用法
> with(trial_start_data, disjoint_subsets(Start_Date, End_Date))
start end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-05-17 2017-06-11
4 2017-06-12 2017-06-20
5 2017-06-21 2017-06-21
6 2017-06-22 2017-06-25
7 2017-06-26 2017-09-28
不过,功能有点慢。仍有改进的余地。对于具有 10 万行的数据帧,它在大约一秒钟内找到所有不相交的子集。对于具有 100 万行的数据框,运行 需要 10-15 秒。查看基准
starts_e6 <- sample(Sys.Date() + -1000:1000, size = 1e6, T)
ends_e6 <- starts_e6 + sample.int(1000, 1e6, T)
starts_e5 <- sample(Sys.Date() + -1000:1000, size = 1e5, T)
ends_e5 <- starts_e5 + sample.int(1000, 1e5, T)
microbenchmark::microbenchmark(
disjoint_subsets(starts_e6, ends_e6),
disjoint_subsets(starts_e5, ends_e5),
times = 2L
)
Unit: milliseconds
expr min lq mean median uq max neval cld
disjoint_subsets(starts_e6, ends_e6) 11299.59 11299.59 11366.623 11366.623 11433.652 11433.652 2 b
disjoint_subsets(starts_e5, ends_e5) 873.66 873.66 1028.057 1028.057 1182.455 1182.455 2 a
base::unique
是这里的瓶颈。但是,如果我们能以某种方式使用更少的元素来表示一个区间,那么我们就可以节省很多时间。
剩下的只是小菜一碟。您可以使用 data.table::foverlap
执行 start_data
和所有不相交子集的非相等连接。然后,汇总加入的data.table得到你想要的end_data
。例如,
library(data.table)
setDT(trial_start_data)[, c("Start_Date", "End_Date") := lapply(.SD, as.Date, "%d/%m/%Y"), .SDcols = c("Start_Date", "End_Date")]
dsubs = trial_start_data[, disjoint_subsets(Start_Date, End_Date)]; setDT(dsubs)
setkey(dsubs, start, end)
setkey(trial_start_data, Start_Date, End_Date)
foverlaps(dsubs, trial_start_data, type = "within")
输出
Person Event Start_Date End_Date start end
1: 1 5 2017-04-24 2017-09-28 2017-04-24 2017-05-08
2: 1 5 2017-04-24 2017-09-28 2017-05-09 2017-05-16
3: 1 6 2017-05-09 2017-05-16 2017-05-09 2017-05-16
4: 1 5 2017-04-24 2017-09-28 2017-05-17 2017-06-11
5: 1 5 2017-04-24 2017-09-28 2017-06-12 2017-06-20
6: 1 7 2017-06-12 2017-06-21 2017-06-12 2017-06-20
7: 1 5 2017-04-24 2017-09-28 2017-06-21 2017-06-21
8: 1 7 2017-06-12 2017-06-21 2017-06-21 2017-06-21
9: 1 8 2017-06-21 2017-06-25 2017-06-21 2017-06-21
10: 1 5 2017-04-24 2017-09-28 2017-06-22 2017-06-25
11: 1 8 2017-06-21 2017-06-25 2017-06-22 2017-06-25
12: 1 5 2017-04-24 2017-09-28 2017-06-26 2017-09-28
只有 10 万行的数据框对于任何 data.table 函数来说应该不是问题。这也是迄今为止我能想到的最有效的方法。我将省略剩余的步骤,因为答案现在已经很长了。另外,我认为我对您之前的一篇帖子的回答涵盖了它们。
我有这个数据:
start_data <- data.frame(stringsAsFactors=FALSE,
Person = c(1, 1, 1, 1),
Event = c(1, 2, 3, 4),
Var1 = c(1, 2, 3, 5),
Var2 = c(7, 8, 9, 6),
Var3 = c(13, 14, 15, 7),
Start_Date = c("1/01/2020", "5/01/2020", "21/01/2020", "23/01/2020"),
End_Date = c("10/01/2020", "20/01/2020", "30/01/2020", "25/01/2020")
)
start_data
Person Event Var1 Var2 Var3 Start_Date End_Date
1 1 1 1 7 13 1/01/2020 10/01/2020
2 1 2 2 8 14 5/01/2020 20/01/2020
3 1 3 3 9 15 21/01/2020 30/01/2020
4 1 4 5 6 7 23/01/2020 25/01/2020
我想把它转换成这样:
end_data <- data.frame(stringsAsFactors=FALSE,
Person = c(1, 1, 1, 1, 1, 1),
Event = c("1", "1 AND 2", "2", "3", "3 AND 4", "3"),
Var1 = c("1", "1 AND 2", "2", "3", "3 AND 5", "3"),
Var2 = c("7", "7 AND 8", "8", "9", "9 AND 6", "9"),
Var3 = c(13, 14, 14, 15, 15, 15),
Start_Date = c("1/01/2020", "5/01/2020", "11/01/2020", "21/01/2020",
"23/01/2020", "26/01/2020"),
End_Date = c("4/01/2020", "10/01/2020", "20/01/2020", "22/01/2020",
"25/01/2020", "30/01/2020")
)
end_data
Person Event Var1 Var2 Var3 Start_Date End_Date
1 1 1 1 7 13 1/01/2020 4/01/2020
2 1 1 AND 2 1 AND 2 7 AND 8 14 5/01/2020 10/01/2020
3 1 2 2 8 14 11/01/2020 20/01/2020
4 1 3 3 9 15 21/01/2020 22/01/2020
5 1 3 AND 4 3 AND 5 9 AND 6 15 23/01/2020 25/01/2020
6 1 3 3 9 15 26/01/2020 30/01/2020
代码应扩展数据框以隔离重叠的 start_date 和 end_date 范围。当存在重叠范围时,它应该创建一个包含重叠数据的新行。因此,当查看最后的 table 时,应该没有 Start_Date 和 End_Date 范围相互重叠。此外,table 的结果应通过连接结果为 Event、Var1 和 Var3 进行汇总。 Var 3 也应该通过取重叠范围内的最大值来聚合。
理想情况下,我想将此代码应用于许多“人”,因此最好使用 group_by() 或嵌套 dplyr。
编辑:
对于以下问题中 3 个重叠时段的情况。它看起来像这样;
编辑2:
@ekoam 的解决方案非常接近。但是,它不处理以下示例。事件 5 概括了整个时期。因此,不应有缺失的范围。但是,分别缺少“2017-05-17”和“2017-06-11”的 'start' 和 'end'。
> trial_start_data <- data.frame(stringsAsFactors=FALSE,
+ Person = c(1, 1, 1, 1),
+ Event = c(5,6,7,8),
+ Start_Date = as.Date(c("24/04/2017","09/05/2017","12/06/2017","21/06/2017"), "%d/%m/%Y"),
+ End_Date = as.Date(c("28/09/2017","16/05/2017","21/06/2017","25/06/2017"), "%d/%m/%Y")
+ )
>
> trial_start_data
Person Event Start_Date End_Date
1 1 5 2017-04-24 2017-09-28
2 1 6 2017-05-09 2017-05-16
3 1 7 2017-06-12 2017-06-21
4 1 8 2017-06-21 2017-06-25
> disjoint_subsets(trial_start_data$Start_Date, trial_start_data$End_Date)
start end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-06-12 2017-06-20
4 2017-06-21 2017-06-21
5 2017-06-22 2017-06-25
6 2017-06-26 2017-09-28
这里的主要问题是以有效的方式找到一组范围的所有不相交子集。考虑这个函数
disjoint_subsets <- function(starts, ends) {
t1 <- min(starts)
starts <- as.integer(starts - t1)
ends <- as.integer(ends - t1) + 2L
nvec <- ends - starts + 1L
x <- sequence(nvec, starts) * 10L
ends <- cumsum(nvec); starts <- ends - nvec + 1L
x[ends] <- x[ends] - 9L; x[starts] <- x[starts] + 9L
x <- sort(unique(x))
b <- which(x %% 10L > 0L)
lb <- x[head(b[!(b + 1L) %in% b], -1L)]
ub <- x[tail(b[!(b - 1L) %in% b], -1L)]
lb <- (lb + 9L * (lb %% 10L < 2L) + 1L) %/% 10L
ub <- (ub - 9L * (ub %% 10L > 8L) - 1L) %/% 10L
data.frame(start = lb + t1 - 1L, end = ub + t1 - 1L)
}
用法
> with(trial_start_data, disjoint_subsets(Start_Date, End_Date))
start end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-05-17 2017-06-11
4 2017-06-12 2017-06-20
5 2017-06-21 2017-06-21
6 2017-06-22 2017-06-25
7 2017-06-26 2017-09-28
不过,功能有点慢。仍有改进的余地。对于具有 10 万行的数据帧,它在大约一秒钟内找到所有不相交的子集。对于具有 100 万行的数据框,运行 需要 10-15 秒。查看基准
starts_e6 <- sample(Sys.Date() + -1000:1000, size = 1e6, T)
ends_e6 <- starts_e6 + sample.int(1000, 1e6, T)
starts_e5 <- sample(Sys.Date() + -1000:1000, size = 1e5, T)
ends_e5 <- starts_e5 + sample.int(1000, 1e5, T)
microbenchmark::microbenchmark(
disjoint_subsets(starts_e6, ends_e6),
disjoint_subsets(starts_e5, ends_e5),
times = 2L
)
Unit: milliseconds
expr min lq mean median uq max neval cld
disjoint_subsets(starts_e6, ends_e6) 11299.59 11299.59 11366.623 11366.623 11433.652 11433.652 2 b
disjoint_subsets(starts_e5, ends_e5) 873.66 873.66 1028.057 1028.057 1182.455 1182.455 2 a
base::unique
是这里的瓶颈。但是,如果我们能以某种方式使用更少的元素来表示一个区间,那么我们就可以节省很多时间。
剩下的只是小菜一碟。您可以使用 data.table::foverlap
执行 start_data
和所有不相交子集的非相等连接。然后,汇总加入的data.table得到你想要的end_data
。例如,
library(data.table)
setDT(trial_start_data)[, c("Start_Date", "End_Date") := lapply(.SD, as.Date, "%d/%m/%Y"), .SDcols = c("Start_Date", "End_Date")]
dsubs = trial_start_data[, disjoint_subsets(Start_Date, End_Date)]; setDT(dsubs)
setkey(dsubs, start, end)
setkey(trial_start_data, Start_Date, End_Date)
foverlaps(dsubs, trial_start_data, type = "within")
输出
Person Event Start_Date End_Date start end
1: 1 5 2017-04-24 2017-09-28 2017-04-24 2017-05-08
2: 1 5 2017-04-24 2017-09-28 2017-05-09 2017-05-16
3: 1 6 2017-05-09 2017-05-16 2017-05-09 2017-05-16
4: 1 5 2017-04-24 2017-09-28 2017-05-17 2017-06-11
5: 1 5 2017-04-24 2017-09-28 2017-06-12 2017-06-20
6: 1 7 2017-06-12 2017-06-21 2017-06-12 2017-06-20
7: 1 5 2017-04-24 2017-09-28 2017-06-21 2017-06-21
8: 1 7 2017-06-12 2017-06-21 2017-06-21 2017-06-21
9: 1 8 2017-06-21 2017-06-25 2017-06-21 2017-06-21
10: 1 5 2017-04-24 2017-09-28 2017-06-22 2017-06-25
11: 1 8 2017-06-21 2017-06-25 2017-06-22 2017-06-25
12: 1 5 2017-04-24 2017-09-28 2017-06-26 2017-09-28
只有 10 万行的数据框对于任何 data.table 函数来说应该不是问题。这也是迄今为止我能想到的最有效的方法。我将省略剩余的步骤,因为答案现在已经很长了。另外,我认为我对您之前的一篇帖子的回答涵盖了它们。