通过 id 合并数据帧,同时交织年份并在年份之间传递值

Merging data frames by id while interweaving years and carry values forward between years

我有两个要合并的数据框。根据 id & year.

,它们都包含有关人员的信息

一个是“主”,一个是补充信息。但是,我无法以常规方式(即 merge()dplyr::left_join())合并它们,因为它们中的 year 值不一定与每个 id 匹配。所以我想按时间顺序从第二个 table 中知道的内容转移到主要 table.

中的每 year

在下面的例子中,我有两个关于军官的 table。 “主要”有 3 列 idyear 和另一个 col_1:

df_main_info <-
  tribble(~id, ~year, ~col_1, 
          1,   2008,  "foo",
          1,   2005,  "bar",
          1,   2010,  "blah", 
          1,   2020,  "bar",  
          2,   1999,  "foo", 
          2,   2020,  "foo",  
          3,   2002,  "bar",
          3,   2010,  "bar",
          4,   2003,  "foo",
          4,   2010,  "bar"
  )

我有一个额外的 table 和 idyear 列,用于每个军官获得军衔的时间以及军衔:

df_ranks_history <-
  tribble(~id, ~year, ~army_rank,
          1,   2005,  "second_lieutenant",
          1,   2010,  "first_lieutenant",
          1,   2018,  "major",
          1,   2021,  "colonel",
          2,   2002,  "major",
          2,   2018,  "colonel",
          3,   1995,  "second_lieutenant",
          3,   2000,  "captain",
          3,   2012,  "colonel"
  )

年份不严格匹配。但是,如果例如军官 id = 3 在 2000 年变成了 "captain",那么我们知道在 2002 年仍然是这样,那么我们可以在第 7 行的 df_main_info 中输入“上尉”。

因此,所需的输出应该是:


desired_output <-
  tribble(~id, ~year, ~col_1, ~army_rank,
          1,   2008,  "foo",   "second_lieutenant",
          1,   2005,  "bar",   "second_lieutenant",
          1,   2010,  "blah",  "first_lieutenant",
          1,   2020,  "bar",   "major",
          2,   1999,  "foo",   NA,
          2,   2020,  "foo",   "colonel",
          3,   2002,  "bar",   "captain",
          3,   2010,  "bar",   "captain",
          4,   2003,  "foo",   NA,
          4,   2010,  "bar",   NA
          )

如果这是相关的,排名按一定顺序排列:

us_army_officer_ranks <- c("second_lieutenant", 
                           "first_lieutenant", 
                           "captain", 
                           "major", 
                           "lieutenant_colonel", 
                           "colonel")
# colonel > lieutenant_colonel > major > captain > first_lieutenant > second_lieutenant
library(dplyr)
library(tidyr)

df_main_info %>% 
  full_join(df_ranks_history, by = c("id", "year")) %>%
  group_by(id) %>%
  arrange(id, year) %>%
  fill(army_rank, .direction = "down") %>%
  filter(!is.na(col_1))
# # A tibble: 10 × 4
# # Groups:   id [4]
#       id  year col_1 army_rank        
#    <dbl> <dbl> <chr> <chr>            
#  1     1  2005 bar   second_lieutenant
#  2     1  2008 foo   second_lieutenant
#  3     1  2010 blah  first_lieutenant 
#  4     1  2020 bar   major            
#  5     2  1999 foo   NA               
#  6     2  2020 foo   colonel          
#  7     3  2002 bar   captain          
#  8     3  2010 bar   captain          
#  9     4  2003 foo   NA               
# 10     4  2010 bar   NA    
library(data.table)

setDT(df_main_info)
setDT(df_ranks_history)

df_ranks_history[df_main_info, on = list(id, year), roll = +Inf]

    id year         army_rank col_1
 1:  1 2008 second_lieutenant   foo
 2:  1 2005 second_lieutenant   bar
 3:  1 2010  first_lieutenant  blah
 4:  1 2020             major   bar
 5:  2 1999              <NA>   foo
 6:  2 2020           colonel   foo
 7:  3 2002           captain   bar
 8:  3 2010           captain   bar
 9:  4 2003              <NA>   foo
10:  4 2010              <NA>   bar