在 R 中查找重叠开始和结束日期的所有日期范围
Find all date ranges for overlapping start and end dates in R
我有一个如下所示的数据框:
w<-read.table(header=TRUE,text="
start.date end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")
我正在尝试获取将重叠的开始日期和结束日期合并到一个日期范围内的输出。所以对于示例集,我想得到:
w<-read.table(header=TRUE,text="
start.date end.date
2006-06-26 2006-08-16
2007-06-09 2007-07-31
2007-08-04 2007-09-04
2007-09-05 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-08-20")
这个问题与 类似,但我不需要对我的问题进行任何分组,所以那里的答案令人困惑。
另外,在回答我的问题时建议的代码不适用于我的数据框的某些部分,例如:
x<-read.table(header=TRUE,text="start.date end.date
2006-01-19 2006-01-20
2006-01-25 2006-01-29
2006-02-24 2006-02-25
2006-03-15 2006-03-22
2006-04-29 2006-04-30
2006-05-24 2006-05-25
2006-06-26 2006-08-16
2006-07-05 2006-07-10
2006-07-12 2006-07-21
2006-08-13 2006-08-15
2006-08-18 2006-08-19
2006-08-28 2006-09-02")
我很困惑为什么它没有?
试试这个:
w[] <- lapply(w, function(x) as.Date(x, '%Y-%m-%d'))
w <- w[order(w$start.date),] # sort the data by start dates if already not sorted
w$group <- 1:nrow(w) # common intervals should belong to same group
merge.indices <- lapply(2:nrow(w), function(x) {
indices <- which(findInterval(w$end.date[1:(x-1)], w$start.date[x])==1)
if (length(indices) > 0) indices <- c(indices, x)
indices})
# assign the intervals the right groups
for (i in 1:length(merge.indices)) {
if (length(merge.indices[[i]]) > 0) {
w$group[merge.indices[[i]]] <- min(w$group[merge.indices[[i]]])
}
}
do.call(rbind, lapply(split(w, w$group), function(x) data.frame(start.date=min(x[,1]), end.date=max(x[,2]))))
它在概念上将重叠区间合并到同一组中,如下所示:
输出:
start.date end.date
1 2006-01-19 2006-01-20
2 2006-01-25 2006-01-29
3 2006-02-24 2006-02-25
4 2006-03-15 2006-03-22
5 2006-04-29 2006-04-30
6 2006-05-24 2006-05-25
7 2006-06-26 2006-08-16
11 2006-08-18 2006-08-19
12 2006-08-28 2006-09-02
解决方案。
w<-read.table(header=TRUE, stringsAsFactor=F, text="
start.date end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")
w <- data.frame(lapply(w, as.Date))
library(lubridate)
idx.rle <- rle(as.numeric(sapply(1:(nrow(w)-1), function(i) int_overlaps(interval(w[i,1],w[i,2]), interval(w[i+1,1],w[i+1,2])))))
i.starts <- nrow(w)-rev(cumsum(rev(idx.rle$length)))
i.ends <- 1+cumsum(idx.rle$length)
do.call(rbind,
lapply(1:length(idx.rle$lengths),
function(i) {
i.start <- i.starts[i]
i.end <- i.ends[i]
if(idx.rle$values[i]==1) {
d <- data.frame(start.date=w[i.start,1],
end.date=max(w[i.start:i.end,2]) );
names(d) <- names(w);
d
} else {
if(idx.rle$lengths[i]>1&i>1&i<length(idx.rle$lengths)) {
data.frame(w[(i.start+1):(i.end-1),] )
} else {
if (idx.rle$lengths[i]>=1&i==1) {
data.frame(w[(i.start):(i.end-1),])
} else {
if(idx.rle$lengths[i]>=1&i==length(idx.rle$lengths)) data.frame(w[(i.start+1):(i.end),] )
}
}
}
}))
IRanges
package on Bioconductor 包含函数 reduce
,可用于将重叠的开始日期和结束日期合并为一个日期范围。
IRanges
适用于整数范围,因此您必须将数据从 class Date
转换为 integer
并返回。这可以包含在一个函数中:
collapse_date_ranges <- function(w, min.gapwidth = 1L) {
library(data.table)
library(magrittr)
IRanges::IRanges(start = as.integer(as.Date(w$start.date)),
end = as.integer(as.Date(w$end.date))) %>%
IRanges::reduce(min.gapwidth = min.gapwidth) %>%
as.data.table() %>%
.[, lapply(.SD, lubridate::as_date),
.SDcols = c("start", "end")]
}
collapse_date_ranges(w, 0L)
# start end
#1: 2006-06-26 2006-08-16
#2: 2007-06-09 2007-07-31
#3: 2007-08-04 2007-09-04
#4: 2007-09-05 2007-10-12
#5: 2007-10-19 2007-11-16
#6: 2007-11-17 2007-12-15
#7: 2008-06-18 2008-08-20
collapse_date_ranges(x, 0L)
# start end
#1: 2006-01-19 2006-01-20
#2: 2006-01-25 2006-01-29
#3: 2006-02-24 2006-02-25
#4: 2006-03-15 2006-03-22
#5: 2006-04-29 2006-04-30
#6: 2006-05-24 2006-05-25
#7: 2006-06-26 2006-08-16
#8: 2006-08-18 2006-08-19
#9: 2006-08-28 2006-09-02
说明
- 为了避免名称冲突,我更喜欢使用双冒号运算符
::
来访问 IRanges
包中的单个函数,而不是使用 library(IRanges)
来加载整个包。
- 将开始日期和结束日期转换为整数(
as.Date
只是为了确保正确class)并创建一个IRanges
对象。
reduce
完成了所有艰苦的工作。这里需要参数 min.gapwidth
,因为 reduce
默认折叠相邻范围(见下文)。
- 最后,结果从整数转换回日期。 (您也可以使用
dplyr
而不是 data.table
。)
- 该解决方案适用于样本数据集
w
和 x
。 x
包括一种特殊情况,其中一个日期范围完全嵌入其他日期范围。
附录:折叠相邻的日期范围
OP 给出的示例结果表明相邻数据范围应该 不 被折叠,例如,范围 2007-10-19
到 2007-11-16
是分开的2007-11-17
到 2007-12-15
范围,尽管第二个范围仅在第一个范围结束一天后开始。
为了以防万一,相邻的日期范围 将被 折叠,这可以通过使用 min.gapwidth
参数的默认值来实现:
collapse_date_ranges(w)
# start end
#1: 2006-06-26 2006-08-16
#2: 2007-06-09 2007-07-31
#3: 2007-08-04 2007-10-12
#4: 2007-10-19 2007-12-15
#5: 2008-06-18 2008-08-20
对于提到这个旧问题的任何人,这里有一个新的选项,它使用一个专门用于处理间隔的包:
library(tidyverse)
library(ivs)
w <- read.table(header = TRUE, text = "
start.date end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")
w |>
mutate(iv = iv(start.date, end.date)) |>
summarise(iv = iv_groups(iv), .groups = "drop")
#> iv
#> 1 [2006-06-26, 2006-08-16)
#> 2 [2007-06-09, 2007-07-31)
#> 3 [2007-08-04, 2007-09-04)
#> 4 [2007-09-05, 2007-10-12)
#> 5 [2007-10-19, 2007-11-16)
#> 6 [2007-11-17, 2007-12-15)
#> 7 [2008-06-18, 2008-08-20)
由 reprex package (v2.0.1)
于 2022-05-27 创建
我有一个如下所示的数据框:
w<-read.table(header=TRUE,text="
start.date end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")
我正在尝试获取将重叠的开始日期和结束日期合并到一个日期范围内的输出。所以对于示例集,我想得到:
w<-read.table(header=TRUE,text="
start.date end.date
2006-06-26 2006-08-16
2007-06-09 2007-07-31
2007-08-04 2007-09-04
2007-09-05 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-08-20")
这个问题与
另外,在回答我的问题时建议的代码不适用于我的数据框的某些部分,例如:
x<-read.table(header=TRUE,text="start.date end.date
2006-01-19 2006-01-20
2006-01-25 2006-01-29
2006-02-24 2006-02-25
2006-03-15 2006-03-22
2006-04-29 2006-04-30
2006-05-24 2006-05-25
2006-06-26 2006-08-16
2006-07-05 2006-07-10
2006-07-12 2006-07-21
2006-08-13 2006-08-15
2006-08-18 2006-08-19
2006-08-28 2006-09-02")
我很困惑为什么它没有?
试试这个:
w[] <- lapply(w, function(x) as.Date(x, '%Y-%m-%d'))
w <- w[order(w$start.date),] # sort the data by start dates if already not sorted
w$group <- 1:nrow(w) # common intervals should belong to same group
merge.indices <- lapply(2:nrow(w), function(x) {
indices <- which(findInterval(w$end.date[1:(x-1)], w$start.date[x])==1)
if (length(indices) > 0) indices <- c(indices, x)
indices})
# assign the intervals the right groups
for (i in 1:length(merge.indices)) {
if (length(merge.indices[[i]]) > 0) {
w$group[merge.indices[[i]]] <- min(w$group[merge.indices[[i]]])
}
}
do.call(rbind, lapply(split(w, w$group), function(x) data.frame(start.date=min(x[,1]), end.date=max(x[,2]))))
它在概念上将重叠区间合并到同一组中,如下所示:
输出:
start.date end.date
1 2006-01-19 2006-01-20
2 2006-01-25 2006-01-29
3 2006-02-24 2006-02-25
4 2006-03-15 2006-03-22
5 2006-04-29 2006-04-30
6 2006-05-24 2006-05-25
7 2006-06-26 2006-08-16
11 2006-08-18 2006-08-19
12 2006-08-28 2006-09-02
解决方案。
w<-read.table(header=TRUE, stringsAsFactor=F, text="
start.date end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")
w <- data.frame(lapply(w, as.Date))
library(lubridate)
idx.rle <- rle(as.numeric(sapply(1:(nrow(w)-1), function(i) int_overlaps(interval(w[i,1],w[i,2]), interval(w[i+1,1],w[i+1,2])))))
i.starts <- nrow(w)-rev(cumsum(rev(idx.rle$length)))
i.ends <- 1+cumsum(idx.rle$length)
do.call(rbind,
lapply(1:length(idx.rle$lengths),
function(i) {
i.start <- i.starts[i]
i.end <- i.ends[i]
if(idx.rle$values[i]==1) {
d <- data.frame(start.date=w[i.start,1],
end.date=max(w[i.start:i.end,2]) );
names(d) <- names(w);
d
} else {
if(idx.rle$lengths[i]>1&i>1&i<length(idx.rle$lengths)) {
data.frame(w[(i.start+1):(i.end-1),] )
} else {
if (idx.rle$lengths[i]>=1&i==1) {
data.frame(w[(i.start):(i.end-1),])
} else {
if(idx.rle$lengths[i]>=1&i==length(idx.rle$lengths)) data.frame(w[(i.start+1):(i.end),] )
}
}
}
}))
IRanges
package on Bioconductor 包含函数 reduce
,可用于将重叠的开始日期和结束日期合并为一个日期范围。
IRanges
适用于整数范围,因此您必须将数据从 class Date
转换为 integer
并返回。这可以包含在一个函数中:
collapse_date_ranges <- function(w, min.gapwidth = 1L) {
library(data.table)
library(magrittr)
IRanges::IRanges(start = as.integer(as.Date(w$start.date)),
end = as.integer(as.Date(w$end.date))) %>%
IRanges::reduce(min.gapwidth = min.gapwidth) %>%
as.data.table() %>%
.[, lapply(.SD, lubridate::as_date),
.SDcols = c("start", "end")]
}
collapse_date_ranges(w, 0L)
# start end
#1: 2006-06-26 2006-08-16
#2: 2007-06-09 2007-07-31
#3: 2007-08-04 2007-09-04
#4: 2007-09-05 2007-10-12
#5: 2007-10-19 2007-11-16
#6: 2007-11-17 2007-12-15
#7: 2008-06-18 2008-08-20
collapse_date_ranges(x, 0L)
# start end
#1: 2006-01-19 2006-01-20
#2: 2006-01-25 2006-01-29
#3: 2006-02-24 2006-02-25
#4: 2006-03-15 2006-03-22
#5: 2006-04-29 2006-04-30
#6: 2006-05-24 2006-05-25
#7: 2006-06-26 2006-08-16
#8: 2006-08-18 2006-08-19
#9: 2006-08-28 2006-09-02
说明
- 为了避免名称冲突,我更喜欢使用双冒号运算符
::
来访问IRanges
包中的单个函数,而不是使用library(IRanges)
来加载整个包。 - 将开始日期和结束日期转换为整数(
as.Date
只是为了确保正确class)并创建一个IRanges
对象。 reduce
完成了所有艰苦的工作。这里需要参数min.gapwidth
,因为reduce
默认折叠相邻范围(见下文)。- 最后,结果从整数转换回日期。 (您也可以使用
dplyr
而不是data.table
。) - 该解决方案适用于样本数据集
w
和x
。x
包括一种特殊情况,其中一个日期范围完全嵌入其他日期范围。
附录:折叠相邻的日期范围
OP 给出的示例结果表明相邻数据范围应该 不 被折叠,例如,范围 2007-10-19
到 2007-11-16
是分开的2007-11-17
到 2007-12-15
范围,尽管第二个范围仅在第一个范围结束一天后开始。
为了以防万一,相邻的日期范围 将被 折叠,这可以通过使用 min.gapwidth
参数的默认值来实现:
collapse_date_ranges(w)
# start end
#1: 2006-06-26 2006-08-16
#2: 2007-06-09 2007-07-31
#3: 2007-08-04 2007-10-12
#4: 2007-10-19 2007-12-15
#5: 2008-06-18 2008-08-20
对于提到这个旧问题的任何人,这里有一个新的选项,它使用一个专门用于处理间隔的包:
library(tidyverse)
library(ivs)
w <- read.table(header = TRUE, text = "
start.date end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")
w |>
mutate(iv = iv(start.date, end.date)) |>
summarise(iv = iv_groups(iv), .groups = "drop")
#> iv
#> 1 [2006-06-26, 2006-08-16)
#> 2 [2007-06-09, 2007-07-31)
#> 3 [2007-08-04, 2007-09-04)
#> 4 [2007-09-05, 2007-10-12)
#> 5 [2007-10-19, 2007-11-16)
#> 6 [2007-11-17, 2007-12-15)
#> 7 [2008-06-18, 2008-08-20)
由 reprex package (v2.0.1)
于 2022-05-27 创建