匹配 R 中一组中最接近的时间戳

Question

假设我有两个数据集。它们都有一个共同的变量——位置。数据集 A 具有秒级精度的时间戳，而数据集 B 具有毫秒级的时间戳。对于 R 或 python?

中的每个位置，我有什么有效的方法可以按时间间隔匹配两个数据集（例如，获取数据集 A 的最新天气）

非常感谢任何想法或意见。

数据集 A 的示例

位置	日期	时间	# 项
纽约	2019-01-01	09:00:00	50
纽约	2019-01-01	09:15:28	10
纽约	2019-01-01	09:16:16	69
纽约	2019-01-01	10:09:00	47
纽约	2019-01-11	19:34:30	777
纽约	2019-01-11	22:10:15	276
...
迈阿密	2019-01-01	09:00:01	100
迈阿密	2019-01-01	16:07:09	145
迈阿密	2019-01-01	20:05:01	56
...
波士顿	2020-12-21	23:09:02	78

数据集 B 的示例：

位置	日期	时间	天气
纽约	2019-01-01	05:56:09.456	下雨
纽约	2019-01-01	08:59:23.897	晴天
纽约	2019-01-01	09:14:35.897	多云
...
波士顿	2020-12-31	23:25:09.987	雪

想法输出将是：

位置	日期	时间	# 项	天气时间	天气
纽约	2019-01-01	09:00:00	50	08:59:23.897	晴天
纽约	2019-01-01	09:15:28	10	09:14:35.897	多云
纽约	2019-01-01	09:16:16	69	09:14:35.897	多云
...

Answer 1

如果您的数据没有大量的位置-日期匹配，这里有一种蛮力方法可能会奏效。

library(dplyr); library(lubridate)

# add timestamp to both
Data_A <- Data_A %>% mutate(timestamp = ymd_hms(paste(Date, Time)))
Data_B <- Data_B %>% mutate(timestamp = ymd_hms(paste(Date, Time)))

# join the two tables
Data_A %>%
  left_join(Data_B, by = c("Location", "Date")) %>%

  # calc time diffs and select best match for each Location/Date
  mutate(time_diff = abs(timestamp.x - timestamp.y)) %>%
  group_by(Location, timestamp.x) %>% # EDIT
  arrange(time_diff) %>%
  slice(1) %>%
  ungroup()

Answer 2

如果我理解正确，数据集 A 应该由数据集 B.

中 Location 的最新可用天气数据完成

这可以通过 滚动连接 和 通过引用更新 :

来实现

library(data.table)
setDT(A)[, dttm := lubridate::ymd_hms(paste(Date, Time))]
setDT(B)[, dttm := lubridate::ymd_hms(paste(Date, Time))]
A[, c("WeatherTime", "Weather") := 
    B[A, on = c("Location", "dttm"), roll = Inf, .(x.dttm, x.Weather)]][]

    Location       Date     Time # items                dttm         WeatherTime Weather
 1: New York 2019-01-01 09:00:00      50 2019-01-01 09:00:00 2019-01-01 08:59:23   Sunny
 2: New York 2019-01-01 09:15:28      10 2019-01-01 09:15:28 2019-01-01 09:14:35  Cloudy
 3: New York 2019-01-01 09:16:16      69 2019-01-01 09:16:16 2019-01-01 09:14:35  Cloudy
 4: New York 2019-01-01 10:09:00      47 2019-01-01 10:09:00 2019-01-01 09:14:35  Cloudy
 5: New York 2019-01-11 19:34:30     777 2019-01-11 19:34:30 2019-01-01 09:14:35  Cloudy
 6: New York 2019-01-11 22:10:15     276 2019-01-11 22:10:15 2019-01-01 09:14:35  Cloudy
 7:    Miami 2019-01-01 09:00:01     100 2019-01-01 09:00:01                <NA>    <NA>
 8:    Miami 2019-01-01 16:07:09     145 2019-01-01 16:07:09                <NA>    <NA>
 9:    Miami 2019-01-01 20:05:01      56 2019-01-01 20:05:01                <NA>    <NA>
10:   Boston 2020-12-21 23:09:02      78 2020-12-21 23:09:02                <NA>    <NA>

请注意，缺少迈阿密的天气数据。示例数据中提供的波士顿天气数据晚了十天。

数据

A <- structure(list(Location = c("New York", "New York", "New York", 
"New York", "New York", "New York", "Miami", "Miami", "Miami", 
"Boston"), Date = structure(c(17897L, 17897L, 17897L, 17897L, 
17907L, 17907L, 17897L, 17897L, 17897L, 18617L), class = c("IDate", 
"Date")), Time = c("09:00:00", "09:15:28", "09:16:16", "10:09:00", 
"19:34:30", "22:10:15", "09:00:01", "16:07:09", "20:05:01", "23:09:02"
), `# items` = c(50L, 10L, 69L, 47L, 777L, 276L, 100L, 145L, 
56L, 78L)), row.names = c(NA, -10L), class = "data.frame")

B <- structure(list(Location = c("New York", "New York", "New York", 
"Boston"), Date = structure(c(17897L, 17897L, 17897L, 18627L), class = c("IDate", 
"Date")), Time = c("05:56:09.456", "08:59:23.897", "09:14:35.897", 
"23:25:09.987"), Weather = c("Rain", "Sunny", "Cloudy", "Snow"
)), row.names = c(NA, -4L), class = "data.frame")

说明

Date 和 Time 组合成一个连续的 POSIXct 日期时间加入。这将避免因日期变化造成的差距。

滚动连接

B[A, on = c("Location", "dttm"), roll = Inf, .(x.dttm, x.Weather)]

returns

                 x.dttm x.Weather
 1: 2019-01-01 08:59:23     Sunny
 2: 2019-01-01 09:14:35    Cloudy
 3: 2019-01-01 09:14:35    Cloudy
 4: 2019-01-01 09:14:35    Cloudy
 5: 2019-01-01 09:14:35    Cloudy
 6: 2019-01-01 09:14:35    Cloudy
 7:                <NA>      <NA>
 8:                <NA>      <NA>
 9:                <NA>      <NA>
10:                <NA>      <NA>

通过引用更新 (c("WeatherTime", "Weather") := ...) 将两个新列附加到数据集 A 而无需复制整个对象。这可能有助于缓解资源限制。

匹配 R 中一组中最接近的时间戳

Match the closest time stamp within a group in R

datetime

fuzzy-search

r

pandas-groupby

数据

说明