通过 id 合并数据帧,同时交织年份并在年份之间传递值
Merging data frames by id while interweaving years and carry values forward between years
我有两个要合并的数据框。根据 id
& year
.
,它们都包含有关人员的信息
一个是“主”,一个是补充信息。但是,我无法以常规方式(即 merge()
或 dplyr::left_join()
)合并它们,因为它们中的 year
值不一定与每个 id
匹配。所以我想按时间顺序从第二个 table 中知道的内容转移到主要 table.
中的每 year
行
在下面的例子中,我有两个关于军官的 table。 “主要”有 3 列 id
、year
和另一个 col_1
:
df_main_info <-
tribble(~id, ~year, ~col_1,
1, 2008, "foo",
1, 2005, "bar",
1, 2010, "blah",
1, 2020, "bar",
2, 1999, "foo",
2, 2020, "foo",
3, 2002, "bar",
3, 2010, "bar",
4, 2003, "foo",
4, 2010, "bar"
)
我有一个额外的 table 和 id
和 year
列,用于每个军官获得军衔的时间以及军衔:
df_ranks_history <-
tribble(~id, ~year, ~army_rank,
1, 2005, "second_lieutenant",
1, 2010, "first_lieutenant",
1, 2018, "major",
1, 2021, "colonel",
2, 2002, "major",
2, 2018, "colonel",
3, 1995, "second_lieutenant",
3, 2000, "captain",
3, 2012, "colonel"
)
年份不严格匹配。但是,如果例如军官 id = 3
在 2000 年变成了 "captain"
,那么我们知道在 2002 年仍然是这样,那么我们可以在第 7 行的 df_main_info
中输入“上尉”。
因此,所需的输出应该是:
desired_output <-
tribble(~id, ~year, ~col_1, ~army_rank,
1, 2008, "foo", "second_lieutenant",
1, 2005, "bar", "second_lieutenant",
1, 2010, "blah", "first_lieutenant",
1, 2020, "bar", "major",
2, 1999, "foo", NA,
2, 2020, "foo", "colonel",
3, 2002, "bar", "captain",
3, 2010, "bar", "captain",
4, 2003, "foo", NA,
4, 2010, "bar", NA
)
如果这是相关的,排名按一定顺序排列:
us_army_officer_ranks <- c("second_lieutenant",
"first_lieutenant",
"captain",
"major",
"lieutenant_colonel",
"colonel")
# colonel > lieutenant_colonel > major > captain > first_lieutenant > second_lieutenant
library(dplyr)
library(tidyr)
df_main_info %>%
full_join(df_ranks_history, by = c("id", "year")) %>%
group_by(id) %>%
arrange(id, year) %>%
fill(army_rank, .direction = "down") %>%
filter(!is.na(col_1))
# # A tibble: 10 × 4
# # Groups: id [4]
# id year col_1 army_rank
# <dbl> <dbl> <chr> <chr>
# 1 1 2005 bar second_lieutenant
# 2 1 2008 foo second_lieutenant
# 3 1 2010 blah first_lieutenant
# 4 1 2020 bar major
# 5 2 1999 foo NA
# 6 2 2020 foo colonel
# 7 3 2002 bar captain
# 8 3 2010 bar captain
# 9 4 2003 foo NA
# 10 4 2010 bar NA
library(data.table)
setDT(df_main_info)
setDT(df_ranks_history)
df_ranks_history[df_main_info, on = list(id, year), roll = +Inf]
id year army_rank col_1
1: 1 2008 second_lieutenant foo
2: 1 2005 second_lieutenant bar
3: 1 2010 first_lieutenant blah
4: 1 2020 major bar
5: 2 1999 <NA> foo
6: 2 2020 colonel foo
7: 3 2002 captain bar
8: 3 2010 captain bar
9: 4 2003 <NA> foo
10: 4 2010 <NA> bar
我有两个要合并的数据框。根据 id
& year
.
一个是“主”,一个是补充信息。但是,我无法以常规方式(即 merge()
或 dplyr::left_join()
)合并它们,因为它们中的 year
值不一定与每个 id
匹配。所以我想按时间顺序从第二个 table 中知道的内容转移到主要 table.
year
行
在下面的例子中,我有两个关于军官的 table。 “主要”有 3 列 id
、year
和另一个 col_1
:
df_main_info <-
tribble(~id, ~year, ~col_1,
1, 2008, "foo",
1, 2005, "bar",
1, 2010, "blah",
1, 2020, "bar",
2, 1999, "foo",
2, 2020, "foo",
3, 2002, "bar",
3, 2010, "bar",
4, 2003, "foo",
4, 2010, "bar"
)
我有一个额外的 table 和 id
和 year
列,用于每个军官获得军衔的时间以及军衔:
df_ranks_history <-
tribble(~id, ~year, ~army_rank,
1, 2005, "second_lieutenant",
1, 2010, "first_lieutenant",
1, 2018, "major",
1, 2021, "colonel",
2, 2002, "major",
2, 2018, "colonel",
3, 1995, "second_lieutenant",
3, 2000, "captain",
3, 2012, "colonel"
)
年份不严格匹配。但是,如果例如军官 id = 3
在 2000 年变成了 "captain"
,那么我们知道在 2002 年仍然是这样,那么我们可以在第 7 行的 df_main_info
中输入“上尉”。
因此,所需的输出应该是:
desired_output <-
tribble(~id, ~year, ~col_1, ~army_rank,
1, 2008, "foo", "second_lieutenant",
1, 2005, "bar", "second_lieutenant",
1, 2010, "blah", "first_lieutenant",
1, 2020, "bar", "major",
2, 1999, "foo", NA,
2, 2020, "foo", "colonel",
3, 2002, "bar", "captain",
3, 2010, "bar", "captain",
4, 2003, "foo", NA,
4, 2010, "bar", NA
)
如果这是相关的,排名按一定顺序排列:
us_army_officer_ranks <- c("second_lieutenant",
"first_lieutenant",
"captain",
"major",
"lieutenant_colonel",
"colonel")
# colonel > lieutenant_colonel > major > captain > first_lieutenant > second_lieutenant
library(dplyr)
library(tidyr)
df_main_info %>%
full_join(df_ranks_history, by = c("id", "year")) %>%
group_by(id) %>%
arrange(id, year) %>%
fill(army_rank, .direction = "down") %>%
filter(!is.na(col_1))
# # A tibble: 10 × 4
# # Groups: id [4]
# id year col_1 army_rank
# <dbl> <dbl> <chr> <chr>
# 1 1 2005 bar second_lieutenant
# 2 1 2008 foo second_lieutenant
# 3 1 2010 blah first_lieutenant
# 4 1 2020 bar major
# 5 2 1999 foo NA
# 6 2 2020 foo colonel
# 7 3 2002 bar captain
# 8 3 2010 bar captain
# 9 4 2003 foo NA
# 10 4 2010 bar NA
library(data.table)
setDT(df_main_info)
setDT(df_ranks_history)
df_ranks_history[df_main_info, on = list(id, year), roll = +Inf]
id year army_rank col_1
1: 1 2008 second_lieutenant foo
2: 1 2005 second_lieutenant bar
3: 1 2010 first_lieutenant blah
4: 1 2020 major bar
5: 2 1999 <NA> foo
6: 2 2020 colonel foo
7: 3 2002 captain bar
8: 3 2010 captain bar
9: 4 2003 <NA> foo
10: 4 2010 <NA> bar