如何通过ID和区间date/time连接不同行号的tibbles/dataframes?
How to join tibbles/dataframes with different row numbers by using the ID and interval date/time?
我在下面举例说明了这两个数据集:
library(lubridate)
library(tidyverse)
#dataset 1
id <- c("A_1", "A_1", "A_1", "A_1", "A_1", "A_2", "A_2", "A_2", "A_2",
"A_2", "B_1", "B_1", "B_1", "B_1", "B_1", "B_2", "B_2", "B_2", "B_2",
"B_2")
date <- ymd_hms(c("2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00"))
df <- tibble(id, date)
# A tibble: 20 x 2
id date
<chr> <dttm>
1 A_1 2017-11-26 09:00:00
2 A_1 2017-11-26 09:05:00
3 A_1 2017-11-30 09:00:00
4 A_1 2017-11-30 09:05:00
5 A_1 2017-12-02 09:00:00
6 A_2 2017-11-26 09:00:00
7 A_2 2017-11-26 09:05:00
8 A_2 2017-11-30 09:00:00
9 A_2 2017-11-30 09:05:00
10 A_2 2017-12-02 09:00:00
11 B_1 2017-11-26 09:00:00
12 B_1 2017-11-26 09:05:00
13 B_1 2017-11-30 09:00:00
14 B_1 2017-11-30 09:05:00
15 B_1 2017-12-02 09:00:00
16 B_2 2017-11-26 09:00:00
17 B_2 2017-11-26 09:05:00
18 B_2 2017-11-30 09:00:00
19 B_2 2017-11-30 09:05:00
20 B_2 2017-12-02 09:00:00
#dataset 2
id <- c("A", "A", "B", "B")
date <- ymd_hms(c("2017-11-26 09:01:30", "2017-11-30 09:06:40", "2017-11-30 09:04:50", "2017-12-02 09:01:00"))
variable1 <- c("67", "30", "28", "90")
variable2 <- c("x","y","z", "w")
df2 <- tibble(id, date, variable1, variable2)
# A tibble: 4 x 4
id date variable1 variable2
<chr> <dttm> <chr> <chr>
1 A 2017-11-26 09:01:30 67 x
2 A 2017-11-30 09:06:40 30 y
3 B 2017-11-30 09:04:50 28 z
4 B 2017-12-02 09:01:00 90 w
我首先需要按“id”分组,然后按“日期和时间”分组,然后提取数据集 1 中最近小时的数据集 2 的列(条件:每行连接到前一个最大小时5 分钟) 在数据集 1 中创建新列。
但是,数据集 2 中的每个“id”在数据集 1 中出现了 50 次,因此,数据集 1 中存在的行可能会在同一日期的数据集 1 中找到对应的小时 50 次。我需要,对于每个“id”,此“提取”的执行次数与相应小时的次数相同,即使它很频繁。
生成的数据集如下所示:
df_output
# A tibble: 20 x 5
id date date2 variable1 variable2
<chr> <dttm> <chr> <chr> <chr>
1 A_1 2017-11-26 09:00:00 2017-11-26 09:01:30 67 x
2 A_1 2017-11-26 09:05:00 NA NA NA
3 A_1 2017-11-30 09:00:00 NA NA NA
4 A_1 2017-11-30 09:05:00 2017-11-30 09:06:40 30 y
5 A_1 2017-12-02 09:00:00 NA NA NA
6 A_2 2017-11-26 09:00:00 2017-11-26 09:01:30 67 x
7 A_2 2017-11-26 09:05:00 NA NA NA
8 A_2 2017-11-30 09:00:00 NA NA NA
9 A_2 2017-11-30 09:05:00 2017-11-30 09:06:40 30 y
10 A_2 2017-12-02 09:00:00 NA NA NA
11 B_1 2017-11-26 09:00:00 NA NA NA
12 B_1 2017-11-26 09:05:00 NA NA NA
13 B_1 2017-11-30 09:00:00 2017-11-30 09:04:50 28 z
14 B_1 2017-11-30 09:05:00 NA NA NA
15 B_1 2017-12-02 09:00:00 2017-12-02 09:01:00 90 w
16 B_2 2017-11-26 09:00:00 NA NA NA
17 B_2 2017-11-26 09:05:00 NA NA NA
18 B_2 2017-11-30 09:00:00 2017-11-30 09:04:50 28 z
19 B_2 2017-11-30 09:05:00 NA NA NA
20 B_2 2017-12-02 09:00:00 2017-12-02 09:01:00 90 w
注意:我仍然需要考虑并非所有行在 dataset2 中都有对应的内容,因此,这些必须用 NA 填充。
提前致谢。
我们可以使用 lubridate
中的 ceiling_date
将日期更改为“5 分钟”间隔。然后 non-equi 加入 data.table
library(lubridate)
library(dplyr)
library(data.table)
df2new <- df2 %>%
mutate(date2 = ceiling_date(date, "5 min"),
date = floor_date(date, "5 min"))
setDT(df)[, id2:= trimws(id, whitespace = "_\d+")][
setDT(df2new), c('date2', 'variable1', 'variable2') := .(date2,
variable1, variable2), on = .(id2 = id, date > date, date <= date2)]
我在下面举例说明了这两个数据集:
library(lubridate)
library(tidyverse)
#dataset 1
id <- c("A_1", "A_1", "A_1", "A_1", "A_1", "A_2", "A_2", "A_2", "A_2",
"A_2", "B_1", "B_1", "B_1", "B_1", "B_1", "B_2", "B_2", "B_2", "B_2",
"B_2")
date <- ymd_hms(c("2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00"))
df <- tibble(id, date)
# A tibble: 20 x 2
id date
<chr> <dttm>
1 A_1 2017-11-26 09:00:00
2 A_1 2017-11-26 09:05:00
3 A_1 2017-11-30 09:00:00
4 A_1 2017-11-30 09:05:00
5 A_1 2017-12-02 09:00:00
6 A_2 2017-11-26 09:00:00
7 A_2 2017-11-26 09:05:00
8 A_2 2017-11-30 09:00:00
9 A_2 2017-11-30 09:05:00
10 A_2 2017-12-02 09:00:00
11 B_1 2017-11-26 09:00:00
12 B_1 2017-11-26 09:05:00
13 B_1 2017-11-30 09:00:00
14 B_1 2017-11-30 09:05:00
15 B_1 2017-12-02 09:00:00
16 B_2 2017-11-26 09:00:00
17 B_2 2017-11-26 09:05:00
18 B_2 2017-11-30 09:00:00
19 B_2 2017-11-30 09:05:00
20 B_2 2017-12-02 09:00:00
#dataset 2
id <- c("A", "A", "B", "B")
date <- ymd_hms(c("2017-11-26 09:01:30", "2017-11-30 09:06:40", "2017-11-30 09:04:50", "2017-12-02 09:01:00"))
variable1 <- c("67", "30", "28", "90")
variable2 <- c("x","y","z", "w")
df2 <- tibble(id, date, variable1, variable2)
# A tibble: 4 x 4
id date variable1 variable2
<chr> <dttm> <chr> <chr>
1 A 2017-11-26 09:01:30 67 x
2 A 2017-11-30 09:06:40 30 y
3 B 2017-11-30 09:04:50 28 z
4 B 2017-12-02 09:01:00 90 w
我首先需要按“id”分组,然后按“日期和时间”分组,然后提取数据集 1 中最近小时的数据集 2 的列(条件:每行连接到前一个最大小时5 分钟) 在数据集 1 中创建新列。
但是,数据集 2 中的每个“id”在数据集 1 中出现了 50 次,因此,数据集 1 中存在的行可能会在同一日期的数据集 1 中找到对应的小时 50 次。我需要,对于每个“id”,此“提取”的执行次数与相应小时的次数相同,即使它很频繁。
生成的数据集如下所示:
df_output
# A tibble: 20 x 5
id date date2 variable1 variable2
<chr> <dttm> <chr> <chr> <chr>
1 A_1 2017-11-26 09:00:00 2017-11-26 09:01:30 67 x
2 A_1 2017-11-26 09:05:00 NA NA NA
3 A_1 2017-11-30 09:00:00 NA NA NA
4 A_1 2017-11-30 09:05:00 2017-11-30 09:06:40 30 y
5 A_1 2017-12-02 09:00:00 NA NA NA
6 A_2 2017-11-26 09:00:00 2017-11-26 09:01:30 67 x
7 A_2 2017-11-26 09:05:00 NA NA NA
8 A_2 2017-11-30 09:00:00 NA NA NA
9 A_2 2017-11-30 09:05:00 2017-11-30 09:06:40 30 y
10 A_2 2017-12-02 09:00:00 NA NA NA
11 B_1 2017-11-26 09:00:00 NA NA NA
12 B_1 2017-11-26 09:05:00 NA NA NA
13 B_1 2017-11-30 09:00:00 2017-11-30 09:04:50 28 z
14 B_1 2017-11-30 09:05:00 NA NA NA
15 B_1 2017-12-02 09:00:00 2017-12-02 09:01:00 90 w
16 B_2 2017-11-26 09:00:00 NA NA NA
17 B_2 2017-11-26 09:05:00 NA NA NA
18 B_2 2017-11-30 09:00:00 2017-11-30 09:04:50 28 z
19 B_2 2017-11-30 09:05:00 NA NA NA
20 B_2 2017-12-02 09:00:00 2017-12-02 09:01:00 90 w
注意:我仍然需要考虑并非所有行在 dataset2 中都有对应的内容,因此,这些必须用 NA 填充。
提前致谢。
我们可以使用 lubridate
中的 ceiling_date
将日期更改为“5 分钟”间隔。然后 non-equi 加入 data.table
library(lubridate)
library(dplyr)
library(data.table)
df2new <- df2 %>%
mutate(date2 = ceiling_date(date, "5 min"),
date = floor_date(date, "5 min"))
setDT(df)[, id2:= trimws(id, whitespace = "_\d+")][
setDT(df2new), c('date2', 'variable1', 'variable2') := .(date2,
variable1, variable2), on = .(id2 = id, date > date, date <= date2)]