根据时差标准匹配两个表

Match two tables based on a time difference criterium

我有一个数据 table (lv_timest),每个日期每 3 小时带有时间戳:

# A tibble: 6 × 5
     LV0_mean LV1_mean  LV2_mean Date_time           Date      
     <dbl>    <dbl>     <dbl>    <S3:POSIXct>        <date>    
1    0.778    -4.12     0.736    2016-12-28 00:00:00 2016-12-28
2    0.376    -0.234    0.388    2016-12-28 03:00:00 2016-12-28
3    0.409    1.46      0.241    2016-12-28 06:00:00 2016-12-28
4    0.760    2.07      0.460    2016-12-28 09:00:00 2016-12-28
5    0.759    2.91      0.735    2016-12-28 12:00:00 2016-12-28
6    0.857    3.00      0.803    2016-12-28 15:00:00 2016-12-28

我想从中提取尽可能接近另一个 table (event_timest):

的时间戳
# A tibble: 6 × 4
   Event_number Date_time           Date       Date_time_new
   <int>        <S3: POSIXct>       <date>     <S3: POSIXct>
1  75           2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00
2  123          2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00
3  264          2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00
4  317          2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00
5  318          2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00
6  369          2017-01-11 07:24:00 2017-01-11 2017-01-11 07:00:00

例如,对于 table event_timest 中的第 1 行,我将从 table lv_timest:

中提取第 4 行
Event_number Date_time.x          Date.x      Date_time_new      LV0_mean LV1_mean   LV2_mean Date_time.y          Date.y
<int>        <S3: POSIXct>        <date>      <S3: POSIXct>      <dbl>    <dbl>      <dbl>    <S3: POSIXct>        <date>                         
75           2016-12-28 08:00:00  2016-12-28 2016-12-28 08:00:00 0.760    2.07       0.460    2016-12-28 09:00:00  2016-12-28

其实时差应该不会超过一小时。我想到为此使用 fuzzyjoin 包,并编写一个函数来计算两个 table 的时间戳之间的时间差,以小时为单位。但是,fuzzy_inner_join 在第二个 table 中复制行并在第一个 table 中使用多个时间戳来匹配它。

require(lubridate)
require(fuzzyjoin)

diff_timest <- function(x, y){abs(x%--%y %/% hours(1)) <= 1} # time interval as hours ≤ 1 hour

match_timest <- fuzzy_inner_join(event_timest, lv_timest,
                                 by = c("Date" = "Date",
                                        "Date_time_new" = "Date_time"),
                                 match_fun = list(`==`, diff_timest))
head(match_timest)

# A tibble: 6 × 9
  Event_number Date_time.x         Date.x     Date_time_new       LV0_mean LV1_mean LV2_mean Date_time.y         Date.y    
         <int> <dttm>              <date>     <dttm>                 <dbl>    <dbl>    <dbl> <dttm>              <date>    
1           75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00   0.760     2.07     0.460 2016-12-28 09:00:00 2016-12-28
2          123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00   1.24      1.83     2.05  2016-12-30 15:00:00 2016-12-30
3          264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00  -0.128    -5.43     2.72  2017-01-07 06:00:00 2017-01-07
4          317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00  -0.0751    0.171    2.56  2017-01-09 09:00:00 2017-01-09
5          317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00  -0.204    -0.797    2.28  2017-01-09 12:00:00 2017-01-09
6          318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00  -0.204    -0.797    2.28  2017-01-09 12:00:00 2017-01-09

还有其他方法吗?

我建议使用标准联接,然后对每个时间戳最接近的实例进行分组筛选:

library(tidyverse)
library(lubridate)

match_timest <- event_timest %>%
  inner_join(lv_timest, by = "Date") %>%
  mutate(diff = abs(as.numeric(Date_time.x - Date_time.y, unit = "hours"))) %>%
  group_by(Date_time.y) %>%
  filter(diff <= 1 & diff == min(diff)) %>%
  ungroup() %>%
  select(!diff)

注:

  • 如果有多行与索引时间戳完全相同的最短差异,这仍然会匹配多行。
  • 这不会匹配不同日期的时间戳 - 例如,23:59:59 on 1/1/22 不会与 00:00:00 on 1/2/22 匹配。如果您想这样做,您可以使用完整的笛卡尔连接 (full_join(lv_timest, by = character())) 而不是上面的 inner_join()

加入始终是首先获取所有行的所有组合然后进行筛选的过程。我们可以手动执行此操作:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

datetimes_a <- tibble(
  id = seq(3),
  group = "A",
  datetime = c("2016-12-28 00:00:00", "2016-12-28 03:00:00", "2016-12-28 23:59:59") %>% as.POSIXct()
)
datetimes_b <- tibble(
  id = seq(3),
  group = "B",
  datetime = c("2016-12-28 00:00:10", "2016-12-28 03:20:00", "2016-12-29 00:00:02") %>% as.POSIXct()
)

datetimes_a %>%
  # start with cross product of all possible pairs
  expand_grid(datetimes_b %>% rename_all(~ paste0(.x, "_b"))) %>%
  mutate(diff = abs(datetime - datetime_b)) %>%
  # get shortest time difference
  group_by(id, id_b) %>%
  arrange(diff) %>%
  slice(1) %>%
  # time diff must be less than 1hr
  filter(diff < hours(1))
#> # A tibble: 3 x 7
#> # Groups:   id, id_b [3]
#>      id group datetime             id_b group_b datetime_b          diff     
#>   <int> <chr> <dttm>              <int> <chr>   <dttm>              <drtn>   
#> 1     1 A     2016-12-28 00:00:00     1 B       2016-12-28 00:00:10   10 secs
#> 2     2 A     2016-12-28 03:00:00     2 B       2016-12-28 03:20:00 1200 secs
#> 3     3 A     2016-12-28 23:59:59     3 B       2016-12-29 00:00:02    3 secs

reprex package (v2.0.1)

于 2022-02-08 创建

如果最近的时间点在另一个日期,这也适用,例如午夜前后。