如何将日期时间舍入到一天中最近的时间,最好是矢量化?

How to round datetime to nearest time of day, preferably vectorized?

假设我有一个类似

的 POSIXct 向量
timestamps = seq(as.POSIXct("2021-01-23"), as.POSIXct("2021-01-24"), length.out = 6)

我想将这些时间四舍五入到向量中最接近的一天中的小时数:

hours_of_day = c(6, 14, 20)

即以下结果:

           timestamps              result
1 2021-01-23 00:00:00 2021-01-23 02:00:00
2 2021-01-23 04:48:00 2021-01-23 14:00:00
3 2021-01-23 09:36:00 2021-01-23 14:00:00
4 2021-01-23 14:24:00 2021-01-23 20:00:00
5 2021-01-23 19:12:00 2021-01-23 20:00:00
6 2021-01-24 00:00:00 2021-01-24 02:00:00

是否有矢量化解决方案(或其他快速解决方案)?我有几百万个时间戳,需要应用几个 hours_of_day.

简化此问题的一种方法是 (1) 为每个 lubridate::hour(timestamps) 找到下一个 hours_of_day,然后 (2) result = lubridate::floor_date(timestamps) + next_hour_of_day * 3600。但是如何将第 1 步矢量化?

转换为as.POSIXlt,它允许您提取小时和分钟,并计算小数小时。在 lapply/sapply 组合中,首先查找小于一天矢量的小时数的位置,然后使用 which.max 选择最大小时数。现在使用 ISOdate 创建新的日期时间并添加一天 ifelse date-time is smaller than original time.

timestamps <- as.POSIXlt(timestamps)

h <- hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600), 
                                `<=`, hours_of_day), which.max)]
r <- with(timestamps, ISOdate(1900 + year, mon + 1, mday, h,
                              tz=attr(timestamps, "tzone")[[1]]))
r[r < timestamps] <- r[r < timestamps] + 86400

结果

r
# [1] "2021-01-23 06:00:00 CET" "2021-01-23 06:00:00 CET"
# [3] "2021-01-23 14:00:00 CET" "2021-01-23 20:00:00 CET"
# [5] "2021-01-23 20:00:00 CET" "2021-01-24 06:00:00 CET"
# [7] "2021-01-25 06:00:00 CET" "2021-01-27 20:00:00 CET"

data.frame(timestamps, r)
#            timestamps                   r
# 1 2021-01-23 00:00:00 2021-01-23 06:00:00
# 2 2021-01-23 04:48:00 2021-01-23 06:00:00
# 3 2021-01-23 09:36:00 2021-01-23 14:00:00
# 4 2021-01-23 14:24:00 2021-01-23 20:00:00
# 5 2021-01-23 19:12:00 2021-01-23 20:00:00
# 6 2021-01-24 00:00:00 2021-01-24 06:00:00
# 7 2021-01-24 23:59:00 2021-01-25 06:00:00
# 8 2021-01-27 20:00:00 2021-01-27 20:00:00

注意:我已将 "2021-01-24 23:59:00 CET" 添加到 timestamps 以演示日期更改。


基准

在长度为 1.4e6 的向量上进行了测试。

# Unit: seconds
#         expr      min       lq     mean   median       uq      max neval cld
#      POSIX() 32.96197 33.06495 33.32104 33.16793 33.50057 33.83321     3  a 
#  lubridate() 47.36412 47.57762 47.75280 47.79113 47.94715 48.10316     3   b

数据:

timestamps <- structure(c(1611356400, 1611373680, 1611390960, 1611408240, 1611425520, 
1611442800, 1611529140, 1611774000), class = c("POSIXct", "POSIXt"
))
hours_of_day <- c(6, 14, 20)

我将提取 hour 组件,使用 cut 将其装箱,并将装箱的小时数分配回原始时间:

hours_of_day = c(2, 14, 20)

library(lubridate)
library(magrittr)  ## just for the pipe
new_hours = timestamps %>% 
  hour %>% 
  cut(breaks = c(0, hours_of_day), labels = hours_of_day, include.lowest = TRUE) %>% 
  as.character() %>%
  as.integer()

result = floor_date(timestamps, "hour")
hour(result) = new_hours

result
# [1] "2021-01-23 02:00:00 EST" "2021-01-23 14:00:00 EST" "2021-01-23 14:00:00 EST"
# [4] "2021-01-23 14:00:00 EST" "2021-01-23 20:00:00 EST" "2021-01-24 02:00:00 EST"

基于@jay.sf 的方法,我还为 floor 创建了一个函数,同时添加了对 NA 值的支持。

floor_date_to = function(timestamps, hours_of_day) {

  # Handle NA with a temporary filler so code below doesn't break
  na_timestamps = is.na(timestamps)
  timestamps[na_timestamps] = as.POSIXct("9999-12-31")  

  # Proceed as usual
  timestamps = as.POSIXlt(timestamps)
  hours_of_day = rev(hours_of_day)  # floor-specific: because which.max returns the first index by default
  nearest_hour = hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600), `<`, hours_of_day), function(x) which.max(-x))]  # floor-specific: negative which.max()
  rounded = with(timestamps, ISOdate(1900 + year, mon + 1, mday, nearest_hour, tz = attr(timestamps, "tzone")[1]))
  rounded[rounded > timestamps] = rounded[rounded > timestamps] - 86400  # floor: use minus
  return(rounded)
  timestamps[na_timestamps] = NA  # Overwrite with NA again
}