合并基于多列的两个数据集,但使时间列在分钟范围内灵活
Merge two datasets based on multiple columns but make time column flexible within minute range
我有一个看起来像这样的数据集
id|date |social_id | race | age | time |Location
1 04/02/19 2000001 W 29 "04:10:05" HA
2 04/06/20 2000002 B 22 "05:12:49" CA
3 04/12/20 2000021 B 26 "09:13:32" MA
4 08/14/20 2000026 A 29 "06:12:34" VT
第二个数据集看起来像这样
id2|date |social_id | race | age | time| sex
1 04/02/19 2000001 W 29 "04:30:05" M
2 04/06/20 2000002 B 22 "05:49:49" F
3 04/12/20 2000021 B 26 "10:13:32" M
4 08/14/20 2000026 A 29 "06:19:54" F
请注意,除时间外,所有列都相同。我想根据这些列进行连接
日期、social_id、race_age 和时间。但是两个数据集的时间都不匹配
df3 <- df1 %>% left_join(df2,by=c("date","social_id","race","time"))
有没有一种方法可以进行多列连接,但在 45 分钟内对时间进行例外处理?时间是字符串格式,所以我通过写
来调整它
abs(difftime(as.POSIXct(strptime(df1$time,format="%H:%M:%S")), as.POSIXct(strptime(df2$time,format="%H:%M:%S")),units = "mins")) <= 45
它可以独立工作并识别时间字符串是否在 45 分钟内。当我进行合并时,如何将它们组合在一起?
structure(list(id = 1:4, date = c("4/2/2019", "4/6/2020", "4/12/2020",
"8/14/2020"), race = c("w", "b", "b", "a"), age = c(29L, 22L,
26L, 29L), time = structure(c(15005L, 18769L, 33212L, 22354L), class =
"ITime")), row.names = c(NA,
-4L), class = "data.frame")
structure(list(id2 = 1:4, date = c("4/2/2019", "4/6/2020", "4/12/2020",
"8/14/2020"), race = c("w", "b", "b", "a"), age = c(29L, 22L,
26L, 29L), time = structure(c(16205L, 20989L, 36812L, 22794L), class =
"ITime")), row.names = c(NA,
-4L), class = "data.frame")
我们可以使用 lubridate
中的 round_date
library(dplyr)
library(lubridate)
library(stringr)
df1 %>%
mutate(datetime = round_date(mdy_hms(str_c(date, time,
sep = ' ')), '45 mins')) %>%
left_join(df2 %>%
mutate(datetime = round_date(mdy_hms(str_c(date, time,
sep = ' ')), '45 mins')),
by = c('datetime', 'id' = 'id2', 'race', 'age'))
我有一个看起来像这样的数据集
id|date |social_id | race | age | time |Location
1 04/02/19 2000001 W 29 "04:10:05" HA
2 04/06/20 2000002 B 22 "05:12:49" CA
3 04/12/20 2000021 B 26 "09:13:32" MA
4 08/14/20 2000026 A 29 "06:12:34" VT
第二个数据集看起来像这样
id2|date |social_id | race | age | time| sex
1 04/02/19 2000001 W 29 "04:30:05" M
2 04/06/20 2000002 B 22 "05:49:49" F
3 04/12/20 2000021 B 26 "10:13:32" M
4 08/14/20 2000026 A 29 "06:19:54" F
请注意,除时间外,所有列都相同。我想根据这些列进行连接
日期、social_id、race_age 和时间。但是两个数据集的时间都不匹配
df3 <- df1 %>% left_join(df2,by=c("date","social_id","race","time"))
有没有一种方法可以进行多列连接,但在 45 分钟内对时间进行例外处理?时间是字符串格式,所以我通过写
来调整它abs(difftime(as.POSIXct(strptime(df1$time,format="%H:%M:%S")), as.POSIXct(strptime(df2$time,format="%H:%M:%S")),units = "mins")) <= 45
它可以独立工作并识别时间字符串是否在 45 分钟内。当我进行合并时,如何将它们组合在一起?
structure(list(id = 1:4, date = c("4/2/2019", "4/6/2020", "4/12/2020",
"8/14/2020"), race = c("w", "b", "b", "a"), age = c(29L, 22L,
26L, 29L), time = structure(c(15005L, 18769L, 33212L, 22354L), class =
"ITime")), row.names = c(NA,
-4L), class = "data.frame")
structure(list(id2 = 1:4, date = c("4/2/2019", "4/6/2020", "4/12/2020",
"8/14/2020"), race = c("w", "b", "b", "a"), age = c(29L, 22L,
26L, 29L), time = structure(c(16205L, 20989L, 36812L, 22794L), class =
"ITime")), row.names = c(NA,
-4L), class = "data.frame")
我们可以使用 lubridate
round_date
library(dplyr)
library(lubridate)
library(stringr)
df1 %>%
mutate(datetime = round_date(mdy_hms(str_c(date, time,
sep = ' ')), '45 mins')) %>%
left_join(df2 %>%
mutate(datetime = round_date(mdy_hms(str_c(date, time,
sep = ' ')), '45 mins')),
by = c('datetime', 'id' = 'id2', 'race', 'age'))