如何通过ID和区间date/time连接不同行号的tibbles/dataframes?

How to join tibbles/dataframes with different row numbers by using the ID and interval date/time?

我在下面举例说明了这两个数据集:

library(lubridate)
library(tidyverse)

#dataset 1

id <- c("A_1", "A_1", "A_1", "A_1", "A_1", "A_2", "A_2", "A_2", "A_2", 
        "A_2", "B_1", "B_1", "B_1", "B_1", "B_1", "B_2", "B_2", "B_2", "B_2", 
        "B_2")
date <- ymd_hms(c("2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00"))    

df <- tibble(id, date)

# A tibble: 20 x 2
   id    date               
   <chr> <dttm>             
 1 A_1   2017-11-26 09:00:00
 2 A_1   2017-11-26 09:05:00
 3 A_1   2017-11-30 09:00:00
 4 A_1   2017-11-30 09:05:00
 5 A_1   2017-12-02 09:00:00
 6 A_2   2017-11-26 09:00:00
 7 A_2   2017-11-26 09:05:00
 8 A_2   2017-11-30 09:00:00
 9 A_2   2017-11-30 09:05:00
10 A_2   2017-12-02 09:00:00
11 B_1   2017-11-26 09:00:00
12 B_1   2017-11-26 09:05:00
13 B_1   2017-11-30 09:00:00
14 B_1   2017-11-30 09:05:00
15 B_1   2017-12-02 09:00:00
16 B_2   2017-11-26 09:00:00
17 B_2   2017-11-26 09:05:00
18 B_2   2017-11-30 09:00:00
19 B_2   2017-11-30 09:05:00
20 B_2   2017-12-02 09:00:00

#dataset 2

id <- c("A", "A", "B", "B")
date <- ymd_hms(c("2017-11-26 09:01:30", "2017-11-30 09:06:40", "2017-11-30 09:04:50", "2017-12-02 09:01:00"))
variable1 <- c("67", "30", "28", "90")
variable2 <- c("x","y","z", "w")
df2 <- tibble(id, date, variable1, variable2)

# A tibble: 4 x 4
  id    date                variable1 variable2
  <chr> <dttm>              <chr>     <chr>    
1 A     2017-11-26 09:01:30 67        x        
2 A     2017-11-30 09:06:40 30        y        
3 B     2017-11-30 09:04:50 28        z        
4 B     2017-12-02 09:01:00 90        w        

我首先需要按“id”分组,然后按“日期和时间”分组,然后提取数据集 1 中最近小时的数据集 2 的列(条件:每行连接到前一个最大小时5 分钟) 在数据集 1 中创建新列。

但是,数据集 2 中的每个“id”在数据集 1 中出现了 50 次,因此,数据集 1 中存在的行可能会在同一日期的数据集 1 中找到对应的小时 50 次。我需要,对于每个“id”,此“提取”的执行次数与相应小时的次数相同,即使它很频繁。

生成的数据集如下所示:

df_output
# A tibble: 20 x 5
   id    date                date2               variable1 variable2
   <chr> <dttm>              <chr>               <chr>     <chr>    
 1 A_1   2017-11-26 09:00:00 2017-11-26 09:01:30 67        x        
 2 A_1   2017-11-26 09:05:00 NA                  NA        NA       
 3 A_1   2017-11-30 09:00:00 NA                  NA        NA       
 4 A_1   2017-11-30 09:05:00 2017-11-30 09:06:40 30        y        
 5 A_1   2017-12-02 09:00:00 NA                  NA        NA       
 6 A_2   2017-11-26 09:00:00 2017-11-26 09:01:30 67        x        
 7 A_2   2017-11-26 09:05:00 NA                  NA        NA       
 8 A_2   2017-11-30 09:00:00 NA                  NA        NA       
 9 A_2   2017-11-30 09:05:00 2017-11-30 09:06:40 30        y        
10 A_2   2017-12-02 09:00:00 NA                  NA        NA       
11 B_1   2017-11-26 09:00:00 NA                  NA        NA       
12 B_1   2017-11-26 09:05:00 NA                  NA        NA       
13 B_1   2017-11-30 09:00:00 2017-11-30 09:04:50 28        z        
14 B_1   2017-11-30 09:05:00 NA                  NA        NA       
15 B_1   2017-12-02 09:00:00 2017-12-02 09:01:00 90        w        
16 B_2   2017-11-26 09:00:00 NA                  NA        NA       
17 B_2   2017-11-26 09:05:00 NA                  NA        NA       
18 B_2   2017-11-30 09:00:00 2017-11-30 09:04:50 28        z        
19 B_2   2017-11-30 09:05:00 NA                  NA        NA       
20 B_2   2017-12-02 09:00:00 2017-12-02 09:01:00 90        w 

注意:我仍然需要考虑并非所有行在 dataset2 中都有对应的内容,因此,这些必须用 NA 填充。

提前致谢。

我们可以使用 lubridate 中的 ceiling_date 将日期更改为“5 分钟”间隔。然后 non-equi 加入 data.table

library(lubridate)
library(dplyr)
library(data.table)
df2new <- df2 %>%
   mutate(date2 = ceiling_date(date, "5 min"), 
          date = floor_date(date, "5 min"))
setDT(df)[, id2:= trimws(id, whitespace = "_\d+")][
   setDT(df2new), c('date2', 'variable1', 'variable2') := .(date2,  
    variable1, variable2), on = .(id2 = id, date > date, date <= date2)]