R:按日期合并两个数据集的特例
R: Special case of merging two datasets by date
我正在使用 R 分析来自动物收容所的数据。我有一个包含动物摄入量的数据集,另一个显示结果的数据集。我想合并这些数据集,使每只动物的摄入量信息和相应的结果信息在同一行。
每个结果都有相应的早期摄入量。有些摄入量没有结果,因为这些动物仍在系统中。一只动物可以在系统中循环多次(例如,将动物交给收容所、收养、送回收容所、再次收养等)
数据框看起来像这样:
摄入量:
Animal.ID Intake.Date Intake.Type
A1 2016-01-01 Surrender
A2 2017-01-01 Stray
A1 2018-01-01 Surrender
A3 2019-01-01 Stray
A4 2020-01-01 Seized
A5 2021-01-01 Surrender
结果:
Animal.ID Outcome.Date Outcome.Type
A1 2016-06-30 Adoption
A2 2017-06-30 Euthanasia
A1 2018-06-30 Transfer
A3 2019-06-30 Adoption
A5 2021-06-30 Transfer
在我的示例中,带有 Animal.ID“A1”的动物在系统中循环了两次。 Animal.ID“A4”的动物没有结果记录,因为该动物仍在收容所的照料中。
如何合并(合并)数据集,使生成的数据集看起来像这样?
合并:
Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
A1 2016-01-01 Surrender 2016-06-30 Adoption
A2 2017-01-01 Stray 2017-06-30 Euthanasia
A1 2018-01-01 Surrender 2018-06-30 Transfer
A3 2019-01-01 Stray 2019-06-30 Adoption
A4 2020-01-01 Seized <NA> <NA>
A5 2021-01-01 Surrender 2021-06-30 Transfer
我想这类问题过去已经解决了,但我在 运行 Google 搜索时一定没有使用正确的术语。
编辑:实际数据包含 date/times(不仅仅是日期)。结果可能在摄入后几分钟内发生,也可能在几个月后发生。
下面是创建这些示例数据集的代码:
intakes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A4","A5"),
Intake.Date = as.Date(c("2016-01-01","2017-01-01","2018-01-01","2019-01-01","2020-01-01","2021-01-01")),
Intake.Type = c("Surrender","Stray","Surrender","Stray","Seized","Surrender")
)
outcomes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A5"),
Outcome.Date = as.Date(c("2016-06-30","2017-06-30","2018-06-30","2019-06-30","2021-06-30")),
Outcome.Type = c("Adoption","Euthanasia","Transfer","Adoption","Transfer")
)
您需要创建一个新变量来匹配,因为数据集中没有足够的数据来确定每行的唯一匹配。看起来你想匹配 Animal.ID
然后是入学年份,所以我创建了一个新变量 year
,匹配两者,然后从最终数据集中删除它。当然,您可以根据需要创建这个新变量,以包含更复杂的情况(例如,2020 年 12 月 31 日入学,结果 2021 年 1 月 1 日)。
library(dplyr)
library(lubridate)
intakes %>%
mutate(year = year(Intake.Date)) %>%
left_join(mutate(outcomes, year = year(Outcome.Date)), by = c("Animal.ID", "year")) %>%
select(-year)
Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
1 A1 2016-01-01 Surrender 2016-06-30 Adoption
2 A2 2017-01-01 Stray 2017-06-30 Euthanasia
3 A1 2018-01-01 Surrender 2018-06-30 Transfer
4 A3 2019-01-01 Stray 2019-06-30 Adoption
5 A4 2020-01-01 Seized <NA> <NA>
6 A5 2021-01-01 Surrender 2021-06-30 Transfer
是正确的,但不是按日期排序,我假设摄入量只与一个结果相关,除了给定动物的最近摄入量,它与零相关或一种结果。
因此,我建议在每个数据集中创建一个新变量,这是每个动物出现的(唯一)数量 (Animal.ID.occurrence
),并在加入时使用它。
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
intakes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A4","A5"),
Intake.Date = as.Date(c("2016-01-01","2017-01-01","2018-01-01","2019-01-01","2020-01-01","2021-01-01")),
Intake.Type = c("Surrender","Stray","Surrender","Stray","Seized","Surrender")
)
outcomes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A5"),
Outcome.Date = as.Date(c("2016-06-30","2017-06-30","2018-06-30","2019-06-30","2021-06-30")),
Outcome.Type = c("Adoption","Euthanasia","Transfer","Adoption","Transfer")
)
intakes_occurrence <- intakes %>% group_by(Animal.ID) %>%
arrange(Intake.Date) %>%
mutate(Animal.ID.occurrence = paste0(Animal.ID, ".", row_number())) %>%
ungroup()
outcomes_occurrence <- outcomes %>% group_by(Animal.ID) %>%
arrange(Outcome.Date) %>%
mutate(Animal.ID.occurrence = paste0(Animal.ID, ".", row_number())) %>%
ungroup() %>%
select(-Animal.ID)
intakes_occurrence %>%
full_join(outcomes_occurrence, by="Animal.ID.occurrence") %>%
select(-Animal.ID.occurrence)
#> # A tibble: 6 × 5
#> Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
#> <chr> <date> <chr> <date> <chr>
#> 1 A1 2016-01-01 Surrender 2016-06-30 Adoption
#> 2 A2 2017-01-01 Stray 2017-06-30 Euthanasia
#> 3 A1 2018-01-01 Surrender 2018-06-30 Transfer
#> 4 A3 2019-01-01 Stray 2019-06-30 Adoption
#> 5 A4 2020-01-01 Seized NA <NA>
#> 6 A5 2021-01-01 Surrender 2021-06-30 Transfer
由 reprex package (v2.0.1)
于 2021-09-05 创建
编辑为按日期排序。
我正在使用 R 分析来自动物收容所的数据。我有一个包含动物摄入量的数据集,另一个显示结果的数据集。我想合并这些数据集,使每只动物的摄入量信息和相应的结果信息在同一行。
每个结果都有相应的早期摄入量。有些摄入量没有结果,因为这些动物仍在系统中。一只动物可以在系统中循环多次(例如,将动物交给收容所、收养、送回收容所、再次收养等)
数据框看起来像这样:
摄入量:
Animal.ID Intake.Date Intake.Type
A1 2016-01-01 Surrender
A2 2017-01-01 Stray
A1 2018-01-01 Surrender
A3 2019-01-01 Stray
A4 2020-01-01 Seized
A5 2021-01-01 Surrender
结果:
Animal.ID Outcome.Date Outcome.Type
A1 2016-06-30 Adoption
A2 2017-06-30 Euthanasia
A1 2018-06-30 Transfer
A3 2019-06-30 Adoption
A5 2021-06-30 Transfer
在我的示例中,带有 Animal.ID“A1”的动物在系统中循环了两次。 Animal.ID“A4”的动物没有结果记录,因为该动物仍在收容所的照料中。
如何合并(合并)数据集,使生成的数据集看起来像这样?
合并:
Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
A1 2016-01-01 Surrender 2016-06-30 Adoption
A2 2017-01-01 Stray 2017-06-30 Euthanasia
A1 2018-01-01 Surrender 2018-06-30 Transfer
A3 2019-01-01 Stray 2019-06-30 Adoption
A4 2020-01-01 Seized <NA> <NA>
A5 2021-01-01 Surrender 2021-06-30 Transfer
我想这类问题过去已经解决了,但我在 运行 Google 搜索时一定没有使用正确的术语。
编辑:实际数据包含 date/times(不仅仅是日期)。结果可能在摄入后几分钟内发生,也可能在几个月后发生。
下面是创建这些示例数据集的代码:
intakes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A4","A5"),
Intake.Date = as.Date(c("2016-01-01","2017-01-01","2018-01-01","2019-01-01","2020-01-01","2021-01-01")),
Intake.Type = c("Surrender","Stray","Surrender","Stray","Seized","Surrender")
)
outcomes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A5"),
Outcome.Date = as.Date(c("2016-06-30","2017-06-30","2018-06-30","2019-06-30","2021-06-30")),
Outcome.Type = c("Adoption","Euthanasia","Transfer","Adoption","Transfer")
)
您需要创建一个新变量来匹配,因为数据集中没有足够的数据来确定每行的唯一匹配。看起来你想匹配 Animal.ID
然后是入学年份,所以我创建了一个新变量 year
,匹配两者,然后从最终数据集中删除它。当然,您可以根据需要创建这个新变量,以包含更复杂的情况(例如,2020 年 12 月 31 日入学,结果 2021 年 1 月 1 日)。
library(dplyr)
library(lubridate)
intakes %>%
mutate(year = year(Intake.Date)) %>%
left_join(mutate(outcomes, year = year(Outcome.Date)), by = c("Animal.ID", "year")) %>%
select(-year)
Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
1 A1 2016-01-01 Surrender 2016-06-30 Adoption
2 A2 2017-01-01 Stray 2017-06-30 Euthanasia
3 A1 2018-01-01 Surrender 2018-06-30 Transfer
4 A3 2019-01-01 Stray 2019-06-30 Adoption
5 A4 2020-01-01 Seized <NA> <NA>
6 A5 2021-01-01 Surrender 2021-06-30 Transfer
因此,我建议在每个数据集中创建一个新变量,这是每个动物出现的(唯一)数量 (Animal.ID.occurrence
),并在加入时使用它。
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
intakes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A4","A5"),
Intake.Date = as.Date(c("2016-01-01","2017-01-01","2018-01-01","2019-01-01","2020-01-01","2021-01-01")),
Intake.Type = c("Surrender","Stray","Surrender","Stray","Seized","Surrender")
)
outcomes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A5"),
Outcome.Date = as.Date(c("2016-06-30","2017-06-30","2018-06-30","2019-06-30","2021-06-30")),
Outcome.Type = c("Adoption","Euthanasia","Transfer","Adoption","Transfer")
)
intakes_occurrence <- intakes %>% group_by(Animal.ID) %>%
arrange(Intake.Date) %>%
mutate(Animal.ID.occurrence = paste0(Animal.ID, ".", row_number())) %>%
ungroup()
outcomes_occurrence <- outcomes %>% group_by(Animal.ID) %>%
arrange(Outcome.Date) %>%
mutate(Animal.ID.occurrence = paste0(Animal.ID, ".", row_number())) %>%
ungroup() %>%
select(-Animal.ID)
intakes_occurrence %>%
full_join(outcomes_occurrence, by="Animal.ID.occurrence") %>%
select(-Animal.ID.occurrence)
#> # A tibble: 6 × 5
#> Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
#> <chr> <date> <chr> <date> <chr>
#> 1 A1 2016-01-01 Surrender 2016-06-30 Adoption
#> 2 A2 2017-01-01 Stray 2017-06-30 Euthanasia
#> 3 A1 2018-01-01 Surrender 2018-06-30 Transfer
#> 4 A3 2019-01-01 Stray 2019-06-30 Adoption
#> 5 A4 2020-01-01 Seized NA <NA>
#> 6 A5 2021-01-01 Surrender 2021-06-30 Transfer
由 reprex package (v2.0.1)
于 2021-09-05 创建编辑为按日期排序。