通过顺序列匹配连接两个数据帧
Joining two data frames by sequential column matching
假设我有两个这样的数据框:
test_1 = game_df = read.table(text = "id winner_name
1 Jon
2 Bob
3 Lucas
4 Marcus
5 Toad
6 Donkey", header = T)
test_2 = game_df = read.table(text = "id_1 id_2 loser_name
9 1 Henry
2 2 George
3 3 Bagel
4 4 Cat
5 5 Giraffe
7 6 Monkey", header = T)
我想做的是首先 left_join
test_1
在 id = id_=1
匹配上,像这样:
test_1 %>% left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1'))
将有某些 NA 比赛(John 和 Donkey)
id winner_name loser_name
1 1 Jon <NA>
2 2 Bob George
3 3 Lucas Bagel
4 4 Marcus Cat
5 5 Toad Giraffe
6 6 Donkey <NA>
然后我想使用 id_2
作为匹配列来填充 NA,所以我目前这样做:
test_1 %>% left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1')) %>%
left_join(test_2 %>% select(id_2, loser_name), by = c('id' = 'id_2'))
id winner_name loser_name.x loser_name.y
1 1 Jon <NA> Henry
2 2 Bob George George
3 3 Lucas Bagel Bagel
4 4 Marcus Cat Cat
5 5 Toad Giraffe Giraffe
6 6 Donkey <NA> Monkey
这似乎可行,但它会生成一堆带有 x
和 y
后缀的重复列。在我的实际数据集中,我必须通过这种条件匹配方法通过大量 id
匹配,因此它会生成大量重复列,然后我必须手动取消选择并重命名。
问题是实际 test_2
data.frame 中有数百列 (loser_name, loser_country, loser_elo, loser_record, loser_win_rate), 等等,所以我需要手动指定名称和要合并的列。此外,因为我用多个id进行这种顺序id匹配,所以我会有loser_name.x、loser_name.y、loser_name.z,而且我事先不知道会有多少后缀是每一列。
有更简单的方法吗?
我们可以在最后做一个coalesce
library(dplyr)
test_1 %>%
left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1')) %>%
left_join(test_2 %>% select(id_2, loser_name), by = c('id' = 'id_2')) %>%
transmute(id, winner_name, loser_name = coalesce(loser_name.x, loser_name.y))
-输出
id winner_name loser_name
1 1 Jon Henry
2 2 Bob George
3 3 Lucas Bagel
4 4 Marcus Cat
5 5 Toad Giraffe
6 6 Donkey Monkey
如果'test_2'中有很多'id_'列,这里有一个选项
library(purrr)
library(stringr)
nm1 <- grep("id_", names(test_2), value = TRUE)
out <- map(nm1, ~ test_2 %>%
select( -starts_with('id'), all_of(.x)) %>%
left_join(test_1, ., by = setNames(.x, 'id'))) %>%
reduce(left_join, by = c("id", "winner_name"))
out %>%
select(starts_with('loser')) %>%
split.default(str_remove(names(.), "\..*")) %>%
map_dfc(~ invoke(coalesce, .)) %>%
bind_cols(test_1, .)
您可以尝试在 test_2 的熔化(=长)版本上加入 test_1。仅当 molten test_2 的顺序与您的 id 的搜索顺序相似时才有效。
现在你可以 id_1, id_2, ..., id_100
library(data.table)
#make them data.tables
setDT(test_1)
setDT(test_2)
#join on molten set
test_1[melt(test_2, id.vars = "loser_name"),
loser_name := i.loser_name,
on = .(id = value)]
# id winner_name loser_name
# 1: 1 Jon Henry
# 2: 2 Bob George
# 3: 3 Lucas Bagel
# 4: 4 Marcus Cat
# 5: 5 Toad Giraffe
# 6: 6 Donkey Monkey
假设我有两个这样的数据框:
test_1 = game_df = read.table(text = "id winner_name
1 Jon
2 Bob
3 Lucas
4 Marcus
5 Toad
6 Donkey", header = T)
test_2 = game_df = read.table(text = "id_1 id_2 loser_name
9 1 Henry
2 2 George
3 3 Bagel
4 4 Cat
5 5 Giraffe
7 6 Monkey", header = T)
我想做的是首先 left_join
test_1
在 id = id_=1
匹配上,像这样:
test_1 %>% left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1'))
将有某些 NA 比赛(John 和 Donkey)
id winner_name loser_name
1 1 Jon <NA>
2 2 Bob George
3 3 Lucas Bagel
4 4 Marcus Cat
5 5 Toad Giraffe
6 6 Donkey <NA>
然后我想使用 id_2
作为匹配列来填充 NA,所以我目前这样做:
test_1 %>% left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1')) %>%
left_join(test_2 %>% select(id_2, loser_name), by = c('id' = 'id_2'))
id winner_name loser_name.x loser_name.y
1 1 Jon <NA> Henry
2 2 Bob George George
3 3 Lucas Bagel Bagel
4 4 Marcus Cat Cat
5 5 Toad Giraffe Giraffe
6 6 Donkey <NA> Monkey
这似乎可行,但它会生成一堆带有 x
和 y
后缀的重复列。在我的实际数据集中,我必须通过这种条件匹配方法通过大量 id
匹配,因此它会生成大量重复列,然后我必须手动取消选择并重命名。
问题是实际 test_2
data.frame 中有数百列 (loser_name, loser_country, loser_elo, loser_record, loser_win_rate), 等等,所以我需要手动指定名称和要合并的列。此外,因为我用多个id进行这种顺序id匹配,所以我会有loser_name.x、loser_name.y、loser_name.z,而且我事先不知道会有多少后缀是每一列。
有更简单的方法吗?
我们可以在最后做一个coalesce
library(dplyr)
test_1 %>%
left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1')) %>%
left_join(test_2 %>% select(id_2, loser_name), by = c('id' = 'id_2')) %>%
transmute(id, winner_name, loser_name = coalesce(loser_name.x, loser_name.y))
-输出
id winner_name loser_name
1 1 Jon Henry
2 2 Bob George
3 3 Lucas Bagel
4 4 Marcus Cat
5 5 Toad Giraffe
6 6 Donkey Monkey
如果'test_2'中有很多'id_'列,这里有一个选项
library(purrr)
library(stringr)
nm1 <- grep("id_", names(test_2), value = TRUE)
out <- map(nm1, ~ test_2 %>%
select( -starts_with('id'), all_of(.x)) %>%
left_join(test_1, ., by = setNames(.x, 'id'))) %>%
reduce(left_join, by = c("id", "winner_name"))
out %>%
select(starts_with('loser')) %>%
split.default(str_remove(names(.), "\..*")) %>%
map_dfc(~ invoke(coalesce, .)) %>%
bind_cols(test_1, .)
您可以尝试在 test_2 的熔化(=长)版本上加入 test_1。仅当 molten test_2 的顺序与您的 id 的搜索顺序相似时才有效。 现在你可以 id_1, id_2, ..., id_100
library(data.table)
#make them data.tables
setDT(test_1)
setDT(test_2)
#join on molten set
test_1[melt(test_2, id.vars = "loser_name"),
loser_name := i.loser_name,
on = .(id = value)]
# id winner_name loser_name
# 1: 1 Jon Henry
# 2: 2 Bob George
# 3: 3 Lucas Bagel
# 4: 4 Marcus Cat
# 5: 5 Toad Giraffe
# 6: 6 Donkey Monkey