如何将日期时间舍入到一天中最近的时间,最好是矢量化?
How to round datetime to nearest time of day, preferably vectorized?
假设我有一个类似
的 POSIXct 向量
timestamps = seq(as.POSIXct("2021-01-23"), as.POSIXct("2021-01-24"), length.out = 6)
我想将这些时间四舍五入到向量中最接近的一天中的小时数:
hours_of_day = c(6, 14, 20)
即以下结果:
timestamps result
1 2021-01-23 00:00:00 2021-01-23 02:00:00
2 2021-01-23 04:48:00 2021-01-23 14:00:00
3 2021-01-23 09:36:00 2021-01-23 14:00:00
4 2021-01-23 14:24:00 2021-01-23 20:00:00
5 2021-01-23 19:12:00 2021-01-23 20:00:00
6 2021-01-24 00:00:00 2021-01-24 02:00:00
是否有矢量化解决方案(或其他快速解决方案)?我有几百万个时间戳,需要应用几个 hours_of_day
.
简化此问题的一种方法是 (1) 为每个 lubridate::hour(timestamps)
找到下一个 hours_of_day
,然后 (2) result = lubridate::floor_date(timestamps) + next_hour_of_day * 3600
。但是如何将第 1 步矢量化?
转换为as.POSIXlt
,它允许您提取小时和分钟,并计算小数小时。在 lapply
/sapply
组合中,首先查找小于一天矢量的小时数的位置,然后使用 which.max
选择最大小时数。现在使用 ISOdate
创建新的日期时间并添加一天 ifelse
date-time is smaller than original time.
timestamps <- as.POSIXlt(timestamps)
h <- hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600),
`<=`, hours_of_day), which.max)]
r <- with(timestamps, ISOdate(1900 + year, mon + 1, mday, h,
tz=attr(timestamps, "tzone")[[1]]))
r[r < timestamps] <- r[r < timestamps] + 86400
结果
r
# [1] "2021-01-23 06:00:00 CET" "2021-01-23 06:00:00 CET"
# [3] "2021-01-23 14:00:00 CET" "2021-01-23 20:00:00 CET"
# [5] "2021-01-23 20:00:00 CET" "2021-01-24 06:00:00 CET"
# [7] "2021-01-25 06:00:00 CET" "2021-01-27 20:00:00 CET"
data.frame(timestamps, r)
# timestamps r
# 1 2021-01-23 00:00:00 2021-01-23 06:00:00
# 2 2021-01-23 04:48:00 2021-01-23 06:00:00
# 3 2021-01-23 09:36:00 2021-01-23 14:00:00
# 4 2021-01-23 14:24:00 2021-01-23 20:00:00
# 5 2021-01-23 19:12:00 2021-01-23 20:00:00
# 6 2021-01-24 00:00:00 2021-01-24 06:00:00
# 7 2021-01-24 23:59:00 2021-01-25 06:00:00
# 8 2021-01-27 20:00:00 2021-01-27 20:00:00
注意:我已将 "2021-01-24 23:59:00 CET"
添加到 timestamps
以演示日期更改。
基准
在长度为 1.4e6 的向量上进行了测试。
# Unit: seconds
# expr min lq mean median uq max neval cld
# POSIX() 32.96197 33.06495 33.32104 33.16793 33.50057 33.83321 3 a
# lubridate() 47.36412 47.57762 47.75280 47.79113 47.94715 48.10316 3 b
数据:
timestamps <- structure(c(1611356400, 1611373680, 1611390960, 1611408240, 1611425520,
1611442800, 1611529140, 1611774000), class = c("POSIXct", "POSIXt"
))
hours_of_day <- c(6, 14, 20)
我将提取 hour
组件,使用 cut
将其装箱,并将装箱的小时数分配回原始时间:
hours_of_day = c(2, 14, 20)
library(lubridate)
library(magrittr) ## just for the pipe
new_hours = timestamps %>%
hour %>%
cut(breaks = c(0, hours_of_day), labels = hours_of_day, include.lowest = TRUE) %>%
as.character() %>%
as.integer()
result = floor_date(timestamps, "hour")
hour(result) = new_hours
result
# [1] "2021-01-23 02:00:00 EST" "2021-01-23 14:00:00 EST" "2021-01-23 14:00:00 EST"
# [4] "2021-01-23 14:00:00 EST" "2021-01-23 20:00:00 EST" "2021-01-24 02:00:00 EST"
基于@jay.sf 的方法,我还为 floor 创建了一个函数,同时添加了对 NA
值的支持。
floor_date_to = function(timestamps, hours_of_day) {
# Handle NA with a temporary filler so code below doesn't break
na_timestamps = is.na(timestamps)
timestamps[na_timestamps] = as.POSIXct("9999-12-31")
# Proceed as usual
timestamps = as.POSIXlt(timestamps)
hours_of_day = rev(hours_of_day) # floor-specific: because which.max returns the first index by default
nearest_hour = hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600), `<`, hours_of_day), function(x) which.max(-x))] # floor-specific: negative which.max()
rounded = with(timestamps, ISOdate(1900 + year, mon + 1, mday, nearest_hour, tz = attr(timestamps, "tzone")[1]))
rounded[rounded > timestamps] = rounded[rounded > timestamps] - 86400 # floor: use minus
return(rounded)
timestamps[na_timestamps] = NA # Overwrite with NA again
}
假设我有一个类似
的 POSIXct 向量timestamps = seq(as.POSIXct("2021-01-23"), as.POSIXct("2021-01-24"), length.out = 6)
我想将这些时间四舍五入到向量中最接近的一天中的小时数:
hours_of_day = c(6, 14, 20)
即以下结果:
timestamps result
1 2021-01-23 00:00:00 2021-01-23 02:00:00
2 2021-01-23 04:48:00 2021-01-23 14:00:00
3 2021-01-23 09:36:00 2021-01-23 14:00:00
4 2021-01-23 14:24:00 2021-01-23 20:00:00
5 2021-01-23 19:12:00 2021-01-23 20:00:00
6 2021-01-24 00:00:00 2021-01-24 02:00:00
是否有矢量化解决方案(或其他快速解决方案)?我有几百万个时间戳,需要应用几个 hours_of_day
.
简化此问题的一种方法是 (1) 为每个 lubridate::hour(timestamps)
找到下一个 hours_of_day
,然后 (2) result = lubridate::floor_date(timestamps) + next_hour_of_day * 3600
。但是如何将第 1 步矢量化?
转换为as.POSIXlt
,它允许您提取小时和分钟,并计算小数小时。在 lapply
/sapply
组合中,首先查找小于一天矢量的小时数的位置,然后使用 which.max
选择最大小时数。现在使用 ISOdate
创建新的日期时间并添加一天 ifelse
date-time is smaller than original time.
timestamps <- as.POSIXlt(timestamps)
h <- hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600),
`<=`, hours_of_day), which.max)]
r <- with(timestamps, ISOdate(1900 + year, mon + 1, mday, h,
tz=attr(timestamps, "tzone")[[1]]))
r[r < timestamps] <- r[r < timestamps] + 86400
结果
r
# [1] "2021-01-23 06:00:00 CET" "2021-01-23 06:00:00 CET"
# [3] "2021-01-23 14:00:00 CET" "2021-01-23 20:00:00 CET"
# [5] "2021-01-23 20:00:00 CET" "2021-01-24 06:00:00 CET"
# [7] "2021-01-25 06:00:00 CET" "2021-01-27 20:00:00 CET"
data.frame(timestamps, r)
# timestamps r
# 1 2021-01-23 00:00:00 2021-01-23 06:00:00
# 2 2021-01-23 04:48:00 2021-01-23 06:00:00
# 3 2021-01-23 09:36:00 2021-01-23 14:00:00
# 4 2021-01-23 14:24:00 2021-01-23 20:00:00
# 5 2021-01-23 19:12:00 2021-01-23 20:00:00
# 6 2021-01-24 00:00:00 2021-01-24 06:00:00
# 7 2021-01-24 23:59:00 2021-01-25 06:00:00
# 8 2021-01-27 20:00:00 2021-01-27 20:00:00
注意:我已将 "2021-01-24 23:59:00 CET"
添加到 timestamps
以演示日期更改。
基准
在长度为 1.4e6 的向量上进行了测试。
# Unit: seconds
# expr min lq mean median uq max neval cld
# POSIX() 32.96197 33.06495 33.32104 33.16793 33.50057 33.83321 3 a
# lubridate() 47.36412 47.57762 47.75280 47.79113 47.94715 48.10316 3 b
数据:
timestamps <- structure(c(1611356400, 1611373680, 1611390960, 1611408240, 1611425520,
1611442800, 1611529140, 1611774000), class = c("POSIXct", "POSIXt"
))
hours_of_day <- c(6, 14, 20)
我将提取 hour
组件,使用 cut
将其装箱,并将装箱的小时数分配回原始时间:
hours_of_day = c(2, 14, 20)
library(lubridate)
library(magrittr) ## just for the pipe
new_hours = timestamps %>%
hour %>%
cut(breaks = c(0, hours_of_day), labels = hours_of_day, include.lowest = TRUE) %>%
as.character() %>%
as.integer()
result = floor_date(timestamps, "hour")
hour(result) = new_hours
result
# [1] "2021-01-23 02:00:00 EST" "2021-01-23 14:00:00 EST" "2021-01-23 14:00:00 EST"
# [4] "2021-01-23 14:00:00 EST" "2021-01-23 20:00:00 EST" "2021-01-24 02:00:00 EST"
基于@jay.sf 的方法,我还为 floor 创建了一个函数,同时添加了对 NA
值的支持。
floor_date_to = function(timestamps, hours_of_day) {
# Handle NA with a temporary filler so code below doesn't break
na_timestamps = is.na(timestamps)
timestamps[na_timestamps] = as.POSIXct("9999-12-31")
# Proceed as usual
timestamps = as.POSIXlt(timestamps)
hours_of_day = rev(hours_of_day) # floor-specific: because which.max returns the first index by default
nearest_hour = hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600), `<`, hours_of_day), function(x) which.max(-x))] # floor-specific: negative which.max()
rounded = with(timestamps, ISOdate(1900 + year, mon + 1, mday, nearest_hour, tz = attr(timestamps, "tzone")[1]))
rounded[rounded > timestamps] = rounded[rounded > timestamps] - 86400 # floor: use minus
return(rounded)
timestamps[na_timestamps] = NA # Overwrite with NA again
}