在 R 中用毫秒合并重叠时间段

Merge overlapping time periods with milliseconds in R

我正在尝试找到一种合并可以处理毫秒的重叠时间间隔的方法。

此处发布了三个可能的选项:

但是,我不需要按 ID 分组,所以我发现 dplyrdata.table 方法令人困惑(我不确定它们是否可以处理毫秒,因为我无法让他们工作)。

我已经设法使 IRanges 解决方案起作用,但它将 POSIXct 对象转换为 as.numeric 整数以计算重叠。所以,我假设这就是输出中缺少毫秒的原因?

缺少毫秒似乎不是显示问题,因为当我减去生成的开始时间和结束时间时,我得到了以秒为单位的整数结果。

这是我的数据示例:

start <- c("2019-07-15 21:32:43.565",
           "2019-07-15 21:32:43.634",
           "2019-07-15 21:32:54.301",
           "2019-07-15 21:34:08.506",
           "2019-07-15 21:34:09.957")

end <- c("2019-07-15 21:32:48.445",
         "2019-07-15 21:32:49.045",
         "2019-07-15 21:32:54.801",
         "2019-07-15 21:34:10.111",
         "2019-07-15 21:34:10.236")

df <- data.frame(start, end)

我从 IRanges 解决方案得到的输出:

                start                 end
1 2019-07-15 21:32:43 2019-07-15 21:32:49
2 2019-07-15 21:32:54 2019-07-15 21:32:54
3 2019-07-15 21:34:08 2019-07-15 21:34:10

以及期望的结果:

                    start                     end
1 2019-07-15 21:32:43.565 2019-07-15 21:32:49.045
2 2019-07-15 21:32:54.301 2019-07-15 21:32:54.801
3 2019-07-15 21:34:08.506 2019-07-15 21:34:10.236

非常感谢您的建议!

我发现如果使用 POSIXlt 格式,保留毫秒数非常容易。虽然有更快的方法来计算重叠,但对于大多数目的来说,它的速度足以循环遍历数据帧。

这是一个可重现的例子。

start <- c("2019-07-15 21:32:43.565",
           "2019-07-15 21:32:43.634",
           "2019-07-15 21:32:54.301",
           "2019-07-15 21:34:08.506",
           "2019-07-15 21:34:09.957")

end <- c("2019-07-15 21:32:48.445",
         "2019-07-15 21:32:49.045",
         "2019-07-15 21:32:54.801",
         "2019-07-15 21:34:10.111",
         "2019-07-15 21:34:10.236")

df    <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))

i <- 1

df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))

while(i < nrow(df))
{
  overlaps <- which(df$start < df$end[i] & df$end > df$start[i])
  if(length(overlaps) > 1)
  {
    df$end[i] <- max(df$end[overlaps])
    df <- df[-overlaps[-which(overlaps == i)], ]
    i <- i - 1
  }
  i <- i + 1
}

所以现在我们的数据框没有重叠:

df
#>                 start                 end
#> 1 2019-07-15 21:32:43 2019-07-15 21:32:49
#> 3 2019-07-15 21:32:54 2019-07-15 21:32:54
#> 4 2019-07-15 21:34:08 2019-07-15 21:34:10

虽然看起来我们丢失了毫秒数,但这只是一个显示问题,我们可以通过这样做来显示:

df$end - df$start
#> Time differences in secs
#> [1] 5.48 0.50 1.73

as.numeric(df$end - df$start)
#> [1] 5.48 0.50 1.73

reprex package (v0.3.0)

于 2020 年 2 月 20 日创建

我认为最好的办法是使用 clock package (for a true sub-second precision date-time type) along with the ivs 包(用于合并重叠间隔)。

将 POSIXct 用于 sub-second date-times 由于各种原因可能会有点挑战,我已经谈到了 here

这里的关键是iv_groups(),它合并了所有重叠的区间和returns所有重叠合并后剩下的区间。它还得到了非常快的 C 实现的支持。

library(clock)
library(ivs)
library(dplyr)

df <- tibble(
  start = c(
    "2019-07-15 21:32:43.565", "2019-07-15 21:32:43.634",
    "2019-07-15 21:32:54.301", "2019-07-15 21:34:08.506",
    "2019-07-15 21:34:09.957"
  ),
  end = c(
    "2019-07-15 21:32:48.445", "2019-07-15 21:32:49.045",
    "2019-07-15 21:32:54.801", "2019-07-15 21:34:10.111",
    "2019-07-15 21:34:10.236"
  )
)

# Parse into "naive time" (i.e. with a yet-to-be-defined time zone)
# using a millisecond precision
df <- df %>%
  mutate(
    start = naive_time_parse(start, format = "%Y-%m-%d %H:%M:%S", precision = "millisecond"),
    end = naive_time_parse(end, format = "%Y-%m-%d %H:%M:%S", precision = "millisecond"),
  )

df
#> # A tibble: 5 × 2
#>   start                   end                    
#>   <tp<naive><milli>>      <tp<naive><milli>>     
#> 1 2019-07-15T21:32:43.565 2019-07-15T21:32:48.445
#> 2 2019-07-15T21:32:43.634 2019-07-15T21:32:49.045
#> 3 2019-07-15T21:32:54.301 2019-07-15T21:32:54.801
#> 4 2019-07-15T21:34:08.506 2019-07-15T21:34:10.111
#> 5 2019-07-15T21:34:09.957 2019-07-15T21:34:10.236

# Now combine these start/end boundaries into a single interval vector
df <- df %>%
  mutate(interval = iv(start, end), .keep = "unused")

df
#> # A tibble: 5 × 1
#>                                             interval
#>                               <iv<tp<naive><milli>>>
#> 1 [2019-07-15T21:32:43.565, 2019-07-15T21:32:48.445)
#> 2 [2019-07-15T21:32:43.634, 2019-07-15T21:32:49.045)
#> 3 [2019-07-15T21:32:54.301, 2019-07-15T21:32:54.801)
#> 4 [2019-07-15T21:34:08.506, 2019-07-15T21:34:10.111)
#> 5 [2019-07-15T21:34:09.957, 2019-07-15T21:34:10.236)

# And use `iv_groups()` to merge all overlapping intervals.
# It returns the remaining intervals after all overlaps have been removed.
df %>%
  summarise(interval = iv_groups(interval))
#> # A tibble: 3 × 1
#>                                             interval
#>                               <iv<tp<naive><milli>>>
#> 1 [2019-07-15T21:32:43.565, 2019-07-15T21:32:49.045)
#> 2 [2019-07-15T21:32:54.301, 2019-07-15T21:32:54.801)
#> 3 [2019-07-15T21:34:08.506, 2019-07-15T21:34:10.236)

reprex package (v2.0.1)

于 2022-04-05 创建