在 dplyr 中,如何按可能存在或不存在的列连接数据帧?

In dplyr, how to join dataframes by columns that may or may not exist?

我有两个要加入的数据库。虽然我总是有一个 main 相互列来加入,但我 有时 可能在数据中有另一个我想加入的列,在除了主要的。

如何指定 可能 列加入?

例子

我用来自 mtcars 的两个数据集来证明我的问题。两者都有一个“主”列(cars),我将始终加入该列,有时在一个或两个数据集中可能有另一个相互列(some_letters)。

library(tidyverse)

create_df <- function(columns_to_include) {
  
  mtcars %>%
    rownames_to_column("cars") %>%
    select(cars, {{ columns_to_include }}) %>%
    slice_sample(n = 15) %>%
    {if (sample(c(TRUE, FALSE), size = 1)) add_column(., some_letters = letters[1:15]) else .}
}

# both dataframes have "some_letters"
set.seed(123)
df_a1 <- create_df(carb)
df_a2 <- create_df(gear)
scenario_a <- inner_join(df_a1, df_a2, by = c("cars", "some_letters"))
scenario_a
#>             cars carb some_letters gear
#> 1 Ford Pantera L    4            l    5

# neither dataframe has "some_letters"
set.seed(111)
df_b1 <- create_df(carb)
df_b2 <- create_df(gear)
scenario_b <- inner_join(df_b1, df_b2, by = c("cars", "some_letters"))
#> Error: Join columns must be present in data.
#> x Problem with `some_letters`.

# one dataframe has "some_letters" but the other doesn't
set.seed(737)
df_c1 <- create_df(carb)
df_c2 <- create_df(gear)
scenario_c <- inner_join(df_c1, df_c2, by = c("cars", "some_letters"))
#> Error: Join columns must be present in data.
#> x Problem with `some_letters`.

reprex package (v0.3.0)

于 2021-02-20 创建

我们可以看到在 scenario_a 中连接有效,因为 df_a1df_a2 都包含 some_letters。但是,在 scenario_b 中,我们看到连接失败,因为 some_letters 不存在(在任一数据中)。同样,scenario_c 显示了 some_letters 出现在一个数据集中但不出现在另一个数据集中的情况,因此连接失败。

加入数据时,我可以指定 some_letters 是可能的,但不保证会出现,这样当它同时出现在两个数据中时,它将成为一个额外的 join-by 列,否则它会是从 by 参数中忽略?

期望的输出

inner_join(df_b1, df_b2, by = c("cars", "some_letters"))

# as if we joined by `cars` only:

##                 cars carb gear
## 1      Porsche 914-2    2    5
## 2 Cadillac Fleetwood    4    3
## 3   Pontiac Firebird    2    3
## 4         Datsun 710    1    4
## 5          Merc 240D    2    4
## 6  Chrysler Imperial    4    3
## 7     Hornet 4 Drive    1    3
## 8         Camaro Z28    4    3

创建一个包含 intersecting 名称的向量

library(dplyr)
library(purrr)
nm1 <- reduce(list(names(df_b1), names(df_b2),
             c("cars", "some_letters")), intersect)

然后加入

inner_join(df_b1, df_b2, by =  nm1)

-输出

#                cars carb gear
#1      Porsche 914-2    2    5
#2 Cadillac Fleetwood    4    3
#3   Pontiac Firebird    2    3
#4         Datsun 710    1    4
#5          Merc 240D    2    4
#6  Chrysler Imperial    4    3
#7     Hornet 4 Drive    1    3
#8         Camaro Z28    4    3