如何展平/合并重叠的时间段
How to flatten / merge overlapping time periods
我有一个大的时间段数据集,由 'start' 和 'end' 列定义。有些时期重叠。
我想合并(展平/合并/折叠)所有重叠的时间段以获得一个 'start' 值和一个 'end' 值。
一些示例数据:
ID start end
1 A 2013-01-01 2013-01-05
2 A 2013-01-01 2013-01-05
3 A 2013-01-02 2013-01-03
4 A 2013-01-04 2013-01-06
5 A 2013-01-07 2013-01-09
6 A 2013-01-08 2013-01-11
7 A 2013-01-12 2013-01-15
想要的结果:
ID start end
1 A 2013-01-01 2013-01-06
2 A 2013-01-07 2013-01-11
3 A 2013-01-12 2013-01-15
我尝试过的:
require(dplyr)
data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"),
start = structure(c(1356998400, 1356998400, 1357084800, 1357257600,
1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct",
"POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200,
1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA,
-7L), class = "data.frame")
remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)
}
data2 <- na.omit(data2)}
data <- remove.overlaps(data)
这是一个可能的解决方案。这里的基本思想是使用 cummax
函数将滞后 start
日期与最大结束日期 "until now" 进行比较,并创建一个索引将数据分成组
data %>%
arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = first(start), end = last(end))
# Source: local data frame [3 x 4]
# Groups: ID
#
# ID indx start end
# 1 A 0 2013-01-01 2013-01-06
# 2 A 1 2013-01-07 2013-01-11
# 3 A 2 2013-01-12 2013-01-15
@David Arenburg 的回答很好 - 但我 运行 遇到了一个问题,即较早的间隔在较晚的间隔之后结束 - 但是在 summarise
调用中使用 last
导致了错误的结束日期。我建议将 first(start)
和 last(end)
更改为 min(start)
和 max(end)
data %>%
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = min(start), end = max(end))
此外,正如@Jonno Bourne 提到的,在应用该方法之前按 start
和任何分组变量排序很重要。
为了完整性,the IRanges
package on Bioconductor 有一些简洁的函数可用于处理日期或日期时间范围。其中之一是合并重叠或相邻范围的 reduce()
函数。
但是,有一个缺点,因为 IRanges
适用于整数范围(因此得名),所以使用 IRanges
函数的便利性是以转换 Date
或POSIXct
对象来回。
此外,似乎 dplyr
与 IRanges
配合得不好(至少根据我对 dplyr
的有限经验判断)所以我使用 data.table
:
library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)
setDT(data)[, {
ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
.(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
ID start end
<fctr> <POSc> <POSc>
1: A 2013-01-01 2013-01-06
2: A 2013-01-07 2013-01-11
3: A 2013-01-12 2013-01-15
代码变体是
setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
, lapply(.SD, as_datetime), .SDcols = -"width"],
by = ID]
在这两种变体中,使用了 lubridate
包中的 as_datetime()
,在将数字转换为 POSIXct
对象时,它可以指定原点。
看到 IRanges
方法与 方法的基准比较会很有趣。
看来我参加派对有点晚了,但我采用了@zach 的代码并使用下面的 data.table
重新编写了它。我没有进行全面测试,但这似乎 运行 比 tidy
版本快了大约 20%。 (我无法测试 IRange
方法,因为该包还不适用于 R 3.5.1)
另外,首先,接受的答案没有捕捉到一个日期范围完全在另一个日期范围内的边缘情况(例如,2018-07-07
到 2017-07-14
在 2018-05-01
到2018-12-01
)。 @zach 的回答确实抓住了这种边缘情况。
library(data.table)
start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")
# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]
# set keys (also orders)
setkey(d2, ID, start_col, end_col)
# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col,
END_DT = end_col,
indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
keyby=ID
][,.(start=min(START_DT),
end = max(END_DT)),
by=c("ID","indx")
]
我认为你可以用 dplyr 和 ivs 包很好地解决这个问题,它是为使用 interval vectors 而设计的,就像你拥有的一样这里。它受到 IRanges 的启发,但更适合在 tidyverse 中使用,并且是完全通用的,因此它可以自动处理日期间隔(无需转换为数字并返回)。
关键是将start/end个边界组合成一个区间向量列,然后使用iv_groups()
。这会合并区间向量中的所有重叠区间和 returns 合并重叠后剩余的区间。
看来你想按ID来做,所以我也按ID分组了。
library(ivs)
library(dplyr)
data <- tribble(
~ID, ~start, ~end,
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-02", "2013-01-03",
"A", "2013-01-04", "2013-01-06",
"A", "2013-01-07", "2013-01-09",
"A", "2013-01-08", "2013-01-11",
"A", "2013-01-12", "2013-01-15"
) %>%
mutate(
start = as.Date(start),
end = as.Date(end)
)
data
#> # A tibble: 7 × 3
#> ID start end
#> <chr> <date> <date>
#> 1 A 2013-01-01 2013-01-05
#> 2 A 2013-01-01 2013-01-05
#> 3 A 2013-01-02 2013-01-03
#> 4 A 2013-01-04 2013-01-06
#> 5 A 2013-01-07 2013-01-09
#> 6 A 2013-01-08 2013-01-11
#> 7 A 2013-01-12 2013-01-15
# Combine `start` and `end` into a single interval vector column
data <- data %>%
mutate(interval = iv(start, end), .keep = "unused")
# Note that this is a half-open interval!
data
#> # A tibble: 7 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-05)
#> 2 A [2013-01-01, 2013-01-05)
#> 3 A [2013-01-02, 2013-01-03)
#> 4 A [2013-01-04, 2013-01-06)
#> 5 A [2013-01-07, 2013-01-09)
#> 6 A [2013-01-08, 2013-01-11)
#> 7 A [2013-01-12, 2013-01-15)
# It seems like you'd want to group by ID, so lets do that.
# Then we use `iv_groups()` which merges all overlapping intervals and returns
# the intervals that remain after all the overlaps have been merged
data %>%
group_by(ID) %>%
summarise(interval = iv_groups(interval), .groups = "drop")
#> # A tibble: 3 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-06)
#> 2 A [2013-01-07, 2013-01-11)
#> 3 A [2013-01-12, 2013-01-15)
由 reprex package (v2.0.1)
于 2022-04-05 创建
我有一个大的时间段数据集,由 'start' 和 'end' 列定义。有些时期重叠。
我想合并(展平/合并/折叠)所有重叠的时间段以获得一个 'start' 值和一个 'end' 值。
一些示例数据:
ID start end
1 A 2013-01-01 2013-01-05
2 A 2013-01-01 2013-01-05
3 A 2013-01-02 2013-01-03
4 A 2013-01-04 2013-01-06
5 A 2013-01-07 2013-01-09
6 A 2013-01-08 2013-01-11
7 A 2013-01-12 2013-01-15
想要的结果:
ID start end
1 A 2013-01-01 2013-01-06
2 A 2013-01-07 2013-01-11
3 A 2013-01-12 2013-01-15
我尝试过的:
require(dplyr)
data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"),
start = structure(c(1356998400, 1356998400, 1357084800, 1357257600,
1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct",
"POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200,
1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA,
-7L), class = "data.frame")
remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)
}
data2 <- na.omit(data2)}
data <- remove.overlaps(data)
这是一个可能的解决方案。这里的基本思想是使用 cummax
函数将滞后 start
日期与最大结束日期 "until now" 进行比较,并创建一个索引将数据分成组
data %>%
arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = first(start), end = last(end))
# Source: local data frame [3 x 4]
# Groups: ID
#
# ID indx start end
# 1 A 0 2013-01-01 2013-01-06
# 2 A 1 2013-01-07 2013-01-11
# 3 A 2 2013-01-12 2013-01-15
@David Arenburg 的回答很好 - 但我 运行 遇到了一个问题,即较早的间隔在较晚的间隔之后结束 - 但是在 summarise
调用中使用 last
导致了错误的结束日期。我建议将 first(start)
和 last(end)
更改为 min(start)
和 max(end)
data %>%
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = min(start), end = max(end))
此外,正如@Jonno Bourne 提到的,在应用该方法之前按 start
和任何分组变量排序很重要。
为了完整性,the IRanges
package on Bioconductor 有一些简洁的函数可用于处理日期或日期时间范围。其中之一是合并重叠或相邻范围的 reduce()
函数。
但是,有一个缺点,因为 IRanges
适用于整数范围(因此得名),所以使用 IRanges
函数的便利性是以转换 Date
或POSIXct
对象来回。
此外,似乎 dplyr
与 IRanges
配合得不好(至少根据我对 dplyr
的有限经验判断)所以我使用 data.table
:
library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)
setDT(data)[, {
ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
.(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
ID start end <fctr> <POSc> <POSc> 1: A 2013-01-01 2013-01-06 2: A 2013-01-07 2013-01-11 3: A 2013-01-12 2013-01-15
代码变体是
setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
, lapply(.SD, as_datetime), .SDcols = -"width"],
by = ID]
在这两种变体中,使用了 lubridate
包中的 as_datetime()
,在将数字转换为 POSIXct
对象时,它可以指定原点。
看到 IRanges
方法与
看来我参加派对有点晚了,但我采用了@zach 的代码并使用下面的 data.table
重新编写了它。我没有进行全面测试,但这似乎 运行 比 tidy
版本快了大约 20%。 (我无法测试 IRange
方法,因为该包还不适用于 R 3.5.1)
另外,首先,接受的答案没有捕捉到一个日期范围完全在另一个日期范围内的边缘情况(例如,2018-07-07
到 2017-07-14
在 2018-05-01
到2018-12-01
)。 @zach 的回答确实抓住了这种边缘情况。
library(data.table)
start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")
# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]
# set keys (also orders)
setkey(d2, ID, start_col, end_col)
# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col,
END_DT = end_col,
indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
keyby=ID
][,.(start=min(START_DT),
end = max(END_DT)),
by=c("ID","indx")
]
我认为你可以用 dplyr 和 ivs 包很好地解决这个问题,它是为使用 interval vectors 而设计的,就像你拥有的一样这里。它受到 IRanges 的启发,但更适合在 tidyverse 中使用,并且是完全通用的,因此它可以自动处理日期间隔(无需转换为数字并返回)。
关键是将start/end个边界组合成一个区间向量列,然后使用iv_groups()
。这会合并区间向量中的所有重叠区间和 returns 合并重叠后剩余的区间。
看来你想按ID来做,所以我也按ID分组了。
library(ivs)
library(dplyr)
data <- tribble(
~ID, ~start, ~end,
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-02", "2013-01-03",
"A", "2013-01-04", "2013-01-06",
"A", "2013-01-07", "2013-01-09",
"A", "2013-01-08", "2013-01-11",
"A", "2013-01-12", "2013-01-15"
) %>%
mutate(
start = as.Date(start),
end = as.Date(end)
)
data
#> # A tibble: 7 × 3
#> ID start end
#> <chr> <date> <date>
#> 1 A 2013-01-01 2013-01-05
#> 2 A 2013-01-01 2013-01-05
#> 3 A 2013-01-02 2013-01-03
#> 4 A 2013-01-04 2013-01-06
#> 5 A 2013-01-07 2013-01-09
#> 6 A 2013-01-08 2013-01-11
#> 7 A 2013-01-12 2013-01-15
# Combine `start` and `end` into a single interval vector column
data <- data %>%
mutate(interval = iv(start, end), .keep = "unused")
# Note that this is a half-open interval!
data
#> # A tibble: 7 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-05)
#> 2 A [2013-01-01, 2013-01-05)
#> 3 A [2013-01-02, 2013-01-03)
#> 4 A [2013-01-04, 2013-01-06)
#> 5 A [2013-01-07, 2013-01-09)
#> 6 A [2013-01-08, 2013-01-11)
#> 7 A [2013-01-12, 2013-01-15)
# It seems like you'd want to group by ID, so lets do that.
# Then we use `iv_groups()` which merges all overlapping intervals and returns
# the intervals that remain after all the overlaps have been merged
data %>%
group_by(ID) %>%
summarise(interval = iv_groups(interval), .groups = "drop")
#> # A tibble: 3 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-06)
#> 2 A [2013-01-07, 2013-01-11)
#> 3 A [2013-01-12, 2013-01-15)
由 reprex package (v2.0.1)
于 2022-04-05 创建