R:按日期合并两个数据集的特例

R: Special case of merging two datasets by date

我正在使用 R 分析来自动物收容所的数据。我有一个包含动物摄入量的数据集,另一个显示结果的数据集。我想合并这些数据集,使每只动物的摄入量信息和相应的结果信息在同一行。

每个结果都有相应的早期摄入量。有些摄入量没有结果,因为这些动物仍在系统中。一只动物可以在系统中循环多次(例如,将动物交给收容所、收养、送回收容所、再次收养等)

数据框看起来像这样:

摄入量:

 Animal.ID Intake.Date Intake.Type
        A1  2016-01-01   Surrender
        A2  2017-01-01       Stray
        A1  2018-01-01   Surrender
        A3  2019-01-01       Stray
        A4  2020-01-01      Seized
        A5  2021-01-01   Surrender

结果:

 Animal.ID Outcome.Date Outcome.Type
        A1   2016-06-30     Adoption
        A2   2017-06-30   Euthanasia
        A1   2018-06-30     Transfer
        A3   2019-06-30     Adoption
        A5   2021-06-30     Transfer

在我的示例中,带有 Animal.ID“A1”的动物在系统中循环了两次。 Animal.ID“A4”的动物没有结果记录,因为该动物仍在收容所的照料中。

如何合并(合并)数据集,使生成的数据集看起来像这样?

合并:

 Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
        A1  2016-01-01   Surrender   2016-06-30     Adoption
        A2  2017-01-01       Stray   2017-06-30   Euthanasia
        A1  2018-01-01   Surrender   2018-06-30     Transfer
        A3  2019-01-01       Stray   2019-06-30     Adoption
        A4  2020-01-01      Seized         <NA>         <NA>
        A5  2021-01-01   Surrender   2021-06-30     Transfer

我想这类问题过去已经解决了,但我在 运行 Google 搜索时一定没有使用正确的术语。

编辑:实际数据包含 date/times(不仅仅是日期)。结果可能在摄入后几分钟内发生,也可能在几个月后发生。

下面是创建这些示例数据集的代码:

intakes <- data.frame(
  Animal.ID = c("A1","A2","A1","A3","A4","A5"),
  Intake.Date = as.Date(c("2016-01-01","2017-01-01","2018-01-01","2019-01-01","2020-01-01","2021-01-01")),
  Intake.Type = c("Surrender","Stray","Surrender","Stray","Seized","Surrender")
)

outcomes <- data.frame(
  Animal.ID = c("A1","A2","A1","A3","A5"),
  Outcome.Date = as.Date(c("2016-06-30","2017-06-30","2018-06-30","2019-06-30","2021-06-30")),
  Outcome.Type = c("Adoption","Euthanasia","Transfer","Adoption","Transfer")
)

您需要创建一个新变量来匹配,因为数据集中没有足够的数据来确定每行的唯一匹配。看起来你想匹配 Animal.ID 然后是入学年份,所以我创建了一个新变量 year,匹配两者,然后从最终数据集中删除它。当然,您可以根据需要创建这个新变量,以包含更复杂的情况(例如,2020 年 12 月 31 日入学,结果 2021 年 1 月 1 日)。

library(dplyr)
library(lubridate)

intakes %>%
  mutate(year = year(Intake.Date)) %>%
  left_join(mutate(outcomes, year = year(Outcome.Date)), by = c("Animal.ID", "year")) %>%
  select(-year)
  Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
1        A1  2016-01-01   Surrender   2016-06-30     Adoption
2        A2  2017-01-01       Stray   2017-06-30   Euthanasia
3        A1  2018-01-01   Surrender   2018-06-30     Transfer
4        A3  2019-01-01       Stray   2019-06-30     Adoption
5        A4  2020-01-01      Seized         <NA>         <NA>
6        A5  2021-01-01   Surrender   2021-06-30     Transfer

是正确的,但不是按日期排序,我假设摄入量只与一个结果相关,除了给定动物的最近摄入量,它与零相关或一种结果。

因此,我建议在每个数据集中创建一个新变量,这是每个动物出现的(唯一)数量 (Animal.ID.occurrence),并在加入时使用它。

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

intakes <- data.frame(
    Animal.ID = c("A1","A2","A1","A3","A4","A5"),
    Intake.Date = as.Date(c("2016-01-01","2017-01-01","2018-01-01","2019-01-01","2020-01-01","2021-01-01")),
    Intake.Type = c("Surrender","Stray","Surrender","Stray","Seized","Surrender")
)

outcomes <- data.frame(
    Animal.ID = c("A1","A2","A1","A3","A5"),
    Outcome.Date = as.Date(c("2016-06-30","2017-06-30","2018-06-30","2019-06-30","2021-06-30")),
    Outcome.Type = c("Adoption","Euthanasia","Transfer","Adoption","Transfer")
)

intakes_occurrence <- intakes %>% group_by(Animal.ID) %>%
    arrange(Intake.Date) %>% 
    mutate(Animal.ID.occurrence = paste0(Animal.ID, ".", row_number())) %>% 
    ungroup()
outcomes_occurrence <- outcomes %>% group_by(Animal.ID) %>%
    arrange(Outcome.Date) %>% 
    mutate(Animal.ID.occurrence = paste0(Animal.ID, ".", row_number())) %>% 
    ungroup() %>% 
    select(-Animal.ID) 

intakes_occurrence %>% 
    full_join(outcomes_occurrence, by="Animal.ID.occurrence") %>% 
    select(-Animal.ID.occurrence)
#> # A tibble: 6 × 5
#>   Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
#>   <chr>     <date>      <chr>       <date>       <chr>       
#> 1 A1        2016-01-01  Surrender   2016-06-30   Adoption    
#> 2 A2        2017-01-01  Stray       2017-06-30   Euthanasia  
#> 3 A1        2018-01-01  Surrender   2018-06-30   Transfer    
#> 4 A3        2019-01-01  Stray       2019-06-30   Adoption    
#> 5 A4        2020-01-01  Seized      NA           <NA>        
#> 6 A5        2021-01-01  Surrender   2021-06-30   Transfer

reprex package (v2.0.1)

于 2021-09-05 创建

编辑为按日期排序。