通过顺序列匹配连接两个数据帧

Joining two data frames by sequential column matching

假设我有两个这样的数据框:

test_1 = game_df = read.table(text = "id winner_name
1 Jon
2 Bob
3 Lucas
4 Marcus
5 Toad
6 Donkey", header = T)

test_2 = game_df = read.table(text = "id_1 id_2 loser_name
9 1 Henry
2 2 George
3 3 Bagel
4 4 Cat
5 5 Giraffe
7 6 Monkey", header = T)

我想做的是首先 left_join test_1id = id_=1 匹配上,像这样:

test_1 %>% left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1'))

将有某些 NA 比赛(John 和 Donkey)

  id winner_name loser_name
1  1         Jon       <NA>
2  2         Bob     George
3  3       Lucas      Bagel
4  4      Marcus        Cat
5  5        Toad    Giraffe
6  6      Donkey       <NA>

然后我想使用 id_2 作为匹配列来填充 NA,所以我目前这样做:

test_1 %>% left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1')) %>%
  left_join(test_2 %>% select(id_2, loser_name), by = c('id' = 'id_2'))

  id winner_name loser_name.x loser_name.y
1  1         Jon         <NA>        Henry
2  2         Bob       George       George
3  3       Lucas        Bagel        Bagel
4  4      Marcus          Cat          Cat
5  5        Toad      Giraffe      Giraffe
6  6      Donkey         <NA>       Monkey

这似乎可行,但它会生成一堆带有 xy 后缀的重复列。在我的实际数据集中,我必须通过这种条件匹配方法通过大量 id 匹配,因此它会生成大量重复列,然后我必须手动取消选择并重命名。

问题是实际 test_2 data.frame 中有数百列 (loser_name, loser_country, loser_elo, loser_record, loser_win_rate), 等等,所以我需要手动指定名称和要合并的列。此外,因为我用多个id进行这种顺序id匹配,所以我会有loser_name.x、loser_name.y、loser_name.z,而且我事先不知道会有多少后缀是每一列。

有更简单的方法吗?

我们可以在最后做一个coalesce

library(dplyr)
test_1 %>%
   left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1')) %>%  
   left_join(test_2 %>% select(id_2, loser_name), by = c('id' = 'id_2')) %>%
   transmute(id, winner_name, loser_name = coalesce(loser_name.x, loser_name.y))

-输出

   id winner_name loser_name
1  1         Jon      Henry
2  2         Bob     George
3  3       Lucas      Bagel
4  4      Marcus        Cat
5  5        Toad    Giraffe
6  6      Donkey     Monkey

如果'test_2'中有很多'id_'列,这里有一个选项

library(purrr)
library(stringr)
nm1 <- grep("id_", names(test_2), value = TRUE)
out <-  map(nm1, ~ test_2 %>% 
          select( -starts_with('id'), all_of(.x)) %>%
           left_join(test_1, ., by = setNames(.x, 'id'))) %>% 
           reduce(left_join, by = c("id", "winner_name")) 
out %>% 
  select(starts_with('loser')) %>% 
  split.default(str_remove(names(.), "\..*")) %>% 
  map_dfc(~ invoke(coalesce, .)) %>% 
  bind_cols(test_1, .)

您可以尝试在 test_2 的熔化(=长)版本上加入 test_1。仅当 molten test_2 的顺序与您的 id 的搜索顺序相似时才有效。 现在你可以 id_1, id_2, ..., id_100

library(data.table)
#make them data.tables
setDT(test_1)
setDT(test_2)

#join on molten set
test_1[melt(test_2, id.vars = "loser_name"), 
       loser_name := i.loser_name, 
       on = .(id = value)]

#    id winner_name loser_name
# 1:  1         Jon      Henry
# 2:  2         Bob     George
# 3:  3       Lucas      Bagel
# 4:  4      Marcus        Cat
# 5:  5        Toad    Giraffe
# 6:  6      Donkey     Monkey