计算多个区间的并集

Question

我想获得许多（超过 2 个）区间的并集：

df <- data.frame(id=c(1, 2, 3),
             interval=c(
               new_interval(ymd("2001-01-01"), ymd("2002-01-01")),
               new_interval(ymd("2001-01-01"), ymd("2004-01-01")),
               new_interval(ymd("2001-02-01"), ymd("2002-01-01"))
               ))
df
#   id                       interval
# 1  1 2001-01-01 UTC--2002-01-01 UTC
# 2  2 2001-01-01 UTC--2004-01-01 UTC
# 3  3 2001-02-01 UTC--2002-01-01 UTC

lubridate::union(lubridate::union(df$interval[1], df$interval[2]),
                 df$interval[3])
# [1] 2001-01-01 UTC--2004-01-01 UTC

这是正确的结果。

但为什么 lubridate::union 不适用于 Reduce？

Reduce(lubridate::union, df$interval )
# [1] 31536000 94608000 28857600

间隔对象似乎也被转换为数字（在应用 union 之前）。

与相关

Answer 1

我不知道 Reduce 的情况，但我会这样做：

library(dplyr)
library(stringr)

df  %>% 
  mutate(interval = str_trim(str_replace_all(interval, "(--|UTC)", " ")),
         int_start = word(interval), 
         int_end = word(interval, -1)) %>% 
  summarise(interval = str_c(min(int_start), 
                             max(int_end), 
                             sep = "--"))
# result
                interval
1 2001-01-01--2004-01-01

Answer 2

这不起作用的原因不是 Reduce()。相反，它是 as.list()，当提供的 x 参数不是列表开头时，它应用于 Reduce() 内的 x。相关行是 Reduce() 中的第 8 行和第 9 行，如下所示。

head(Reduce, 9)
# ...                                                           
# 8      if (!is.vector(x) || is.object(x))                   
# 9          x <- as.list(x)

对 if() 条件的快速检查证实了这一点。

!is.vector(df$interval) || is.object(df$interval)
# [1] TRUE

因此，在您对 Reduce() 的调用中，as.list() 用于 df$interval，这意味着 df$interval 变为

as.list(df$interval)
# [[1]]
# [1] 31536000
#
# [[2]]
# [1] 94608000
#
# [[3]]
# [1] 28857600

在 Reduce() 中的任何重要操作发生之前（实际上，对于我们的目的而言，这是最重要的操作）。这使得 Reduce() 输出变得合理； returns 这三个都是独一无二的。

如果您确实需要使用 Reduce()，您可以绕过列表检查并首先构建您自己的列表，使用 for() 循环（因为 lapply() 也不起作用）。然后我们可以将其提供给 Reduce() 并获得所需的正确输出。

x <- vector("list", length(df$interval))
for(i in seq_along(x)) x[[i]] <- df$interval[i]

Reduce(lubridate::union, x)
# [1] 2001-01-01 UTC--2004-01-01 UTC

但最好为 Interval class 编写一个 as.list() 方法并将其放在脚本的顶部。我们可以使用与上面相同的代码。

as.list.Interval <- function(x, ...) {
    out <- vector("list", length(x))
    for(i in seq_along(x)) out[[i]] <- x[i]
    out
}

Reduce(lubridate::union, df$interval)
# [1] 2001-01-01 UTC--2004-01-01 UTC

另请注意，您可以通过另一种方式执行此操作，即抓住起始位置并使用 int_end()。

interval(min(slot(df$interval, "start")), max(int_end(df$interval)))
# [1] 2001-01-01 UTC--2004-01-01 UTC

Answer 3

刚在lubridate包上解决了 https://github.com/hadley/lubridate/issues/348

计算多个区间的并集

Calculate the union of many intervals

r

lubridate