如何通过两列合并其中一列
How to merge by two columns aggregating one of them
我正在苦苦思索如何使用两列进行合并。我有一个数据框,其中包含有关在某些日期使用了多少调色板的度量。我有另一个包含汽车行驶距离的数据框。然后我需要合并两者,加入的条件是:汽车和一辆车的距离总和,直到调色板测量发生的日期。
这是一个玩具示例:
#palette measure dataframe
measure = data.frame(car = c("A", "A", "A", "B"), data1 = c("20-09-2020", "15-10-2020", "13-05-2021", "20-10-2021"), palette = c(5,4,3,5))
#> measure
# car data1 palette
#1 A 20-09-2020 5
#2 A 15-10-2020 4
#3 A 13-05-2021 3
#4 B 20-10-2021 5
#the distance dataframe
dist_ = data.frame(car = c("A", "C", "B", "A", "A", "A"), data2 = c("20-09-2020", "14-05-2020", "20-10-2021", "10-01-2021", "11-01-2021", "13-01-2021"), distance = c(10, 20, 10, 5, 3,8))
#> dist_
# car data2 distance
#1 A 20-09-2020 10
#2 C 14-05-2020 20
#3 B 20-10-2021 10
#4 A 10-01-2021 5
#5 A 11-01-2021 3
#6 A 13-01-2021 8
#for result I'd like something like
# car data1 palette distance
#1 A 20-09-2020 5 10
#2 A 15-10-2020 4 0
#3 A 13-05-2020 3 16
#4 B 20-10-2021 5 10
请注意,在我有一个测量调色板的日期之前,距离会被求和。所以我可以说一辆车行驶了 16 公里,它的调色板是 3 厘米。
我想我可以使用 merge(x = measure, y = dist_, by.x=c("car", "date1"), by.y=c("car", "data2"),all.x = T)
之类的东西,但我不知道如何计算距离值,直到特定汽车的托盘测量日期为止。
关于我该怎么做的任何提示?
像这样的东西会起作用:
library(tidyverse)
library(lubridate)
result <- left_join(measure, dist_, by = c("car")) %>%
mutate(across(c("data1", "data2"), dmy)) %>%
filter(data1 >= data2) %>%
group_by(car, data2) %>%
mutate(threshold = min(data1)) %>%
ungroup() %>%
filter(data1 == threshold) %>%
group_by(car, data1, palette)%>%
summarise(distance = sum(distance))
result
# A tibble: 3 x 4
# Groups: car, data1 [3]
car data1 palette distance
<chr> <date> <dbl> <dbl>
1 A 2020-09-20 5 10
2 A 2021-05-13 3 16
3 B 2021-10-20 5 10
如果您想保留不匹配项,您可以像这样使用 measure
重新加入:
result.final <- measure %>%
mutate(data1 = dmy(data1))%>%
left_join(result, by = c("car", "data1", "palette"))
result.final
car data1 palette distance
1 A 2020-09-20 5 10
2 A 2020-10-15 4 NA
3 A 2021-05-13 3 16
4 B 2021-10-20 5 10
我正在苦苦思索如何使用两列进行合并。我有一个数据框,其中包含有关在某些日期使用了多少调色板的度量。我有另一个包含汽车行驶距离的数据框。然后我需要合并两者,加入的条件是:汽车和一辆车的距离总和,直到调色板测量发生的日期。 这是一个玩具示例:
#palette measure dataframe
measure = data.frame(car = c("A", "A", "A", "B"), data1 = c("20-09-2020", "15-10-2020", "13-05-2021", "20-10-2021"), palette = c(5,4,3,5))
#> measure
# car data1 palette
#1 A 20-09-2020 5
#2 A 15-10-2020 4
#3 A 13-05-2021 3
#4 B 20-10-2021 5
#the distance dataframe
dist_ = data.frame(car = c("A", "C", "B", "A", "A", "A"), data2 = c("20-09-2020", "14-05-2020", "20-10-2021", "10-01-2021", "11-01-2021", "13-01-2021"), distance = c(10, 20, 10, 5, 3,8))
#> dist_
# car data2 distance
#1 A 20-09-2020 10
#2 C 14-05-2020 20
#3 B 20-10-2021 10
#4 A 10-01-2021 5
#5 A 11-01-2021 3
#6 A 13-01-2021 8
#for result I'd like something like
# car data1 palette distance
#1 A 20-09-2020 5 10
#2 A 15-10-2020 4 0
#3 A 13-05-2020 3 16
#4 B 20-10-2021 5 10
请注意,在我有一个测量调色板的日期之前,距离会被求和。所以我可以说一辆车行驶了 16 公里,它的调色板是 3 厘米。
我想我可以使用 merge(x = measure, y = dist_, by.x=c("car", "date1"), by.y=c("car", "data2"),all.x = T)
之类的东西,但我不知道如何计算距离值,直到特定汽车的托盘测量日期为止。
关于我该怎么做的任何提示?
像这样的东西会起作用:
library(tidyverse)
library(lubridate)
result <- left_join(measure, dist_, by = c("car")) %>%
mutate(across(c("data1", "data2"), dmy)) %>%
filter(data1 >= data2) %>%
group_by(car, data2) %>%
mutate(threshold = min(data1)) %>%
ungroup() %>%
filter(data1 == threshold) %>%
group_by(car, data1, palette)%>%
summarise(distance = sum(distance))
result
# A tibble: 3 x 4
# Groups: car, data1 [3]
car data1 palette distance
<chr> <date> <dbl> <dbl>
1 A 2020-09-20 5 10
2 A 2021-05-13 3 16
3 B 2021-10-20 5 10
如果您想保留不匹配项,您可以像这样使用 measure
重新加入:
result.final <- measure %>%
mutate(data1 = dmy(data1))%>%
left_join(result, by = c("car", "data1", "palette"))
result.final
car data1 palette distance
1 A 2020-09-20 5 10
2 A 2020-10-15 4 NA
3 A 2021-05-13 3 16
4 B 2021-10-20 5 10