在 dplyr 中,如何按可能存在或不存在的列连接数据帧?
In dplyr, how to join dataframes by columns that may or may not exist?
我有两个要加入的数据库。虽然我总是有一个 main 相互列来加入,但我 有时 可能在数据中有另一个我想加入的列,在除了主要的。
如何指定 可能 列加入?
例子
我用来自 mtcars
的两个数据集来证明我的问题。两者都有一个“主”列(cars
),我将始终加入该列,有时在一个或两个数据集中可能有另一个相互列(some_letters
)。
library(tidyverse)
create_df <- function(columns_to_include) {
mtcars %>%
rownames_to_column("cars") %>%
select(cars, {{ columns_to_include }}) %>%
slice_sample(n = 15) %>%
{if (sample(c(TRUE, FALSE), size = 1)) add_column(., some_letters = letters[1:15]) else .}
}
# both dataframes have "some_letters"
set.seed(123)
df_a1 <- create_df(carb)
df_a2 <- create_df(gear)
scenario_a <- inner_join(df_a1, df_a2, by = c("cars", "some_letters"))
scenario_a
#> cars carb some_letters gear
#> 1 Ford Pantera L 4 l 5
# neither dataframe has "some_letters"
set.seed(111)
df_b1 <- create_df(carb)
df_b2 <- create_df(gear)
scenario_b <- inner_join(df_b1, df_b2, by = c("cars", "some_letters"))
#> Error: Join columns must be present in data.
#> x Problem with `some_letters`.
# one dataframe has "some_letters" but the other doesn't
set.seed(737)
df_c1 <- create_df(carb)
df_c2 <- create_df(gear)
scenario_c <- inner_join(df_c1, df_c2, by = c("cars", "some_letters"))
#> Error: Join columns must be present in data.
#> x Problem with `some_letters`.
由 reprex package (v0.3.0)
于 2021-02-20 创建
我们可以看到在 scenario_a
中连接有效,因为 df_a1
和 df_a2
都包含 some_letters
。但是,在 scenario_b
中,我们看到连接失败,因为 some_letters
不存在(在任一数据中)。同样,scenario_c
显示了 some_letters
出现在一个数据集中但不出现在另一个数据集中的情况,因此连接失败。
加入数据时,我可以指定 some_letters
是可能的,但不保证会出现,这样当它同时出现在两个数据中时,它将成为一个额外的 join-by 列,否则它会是从 by
参数中忽略?
期望的输出
inner_join(df_b1, df_b2, by = c("cars", "some_letters"))
# as if we joined by `cars` only:
## cars carb gear
## 1 Porsche 914-2 2 5
## 2 Cadillac Fleetwood 4 3
## 3 Pontiac Firebird 2 3
## 4 Datsun 710 1 4
## 5 Merc 240D 2 4
## 6 Chrysler Imperial 4 3
## 7 Hornet 4 Drive 1 3
## 8 Camaro Z28 4 3
创建一个包含 intersect
ing 名称的向量
library(dplyr)
library(purrr)
nm1 <- reduce(list(names(df_b1), names(df_b2),
c("cars", "some_letters")), intersect)
然后加入
inner_join(df_b1, df_b2, by = nm1)
-输出
# cars carb gear
#1 Porsche 914-2 2 5
#2 Cadillac Fleetwood 4 3
#3 Pontiac Firebird 2 3
#4 Datsun 710 1 4
#5 Merc 240D 2 4
#6 Chrysler Imperial 4 3
#7 Hornet 4 Drive 1 3
#8 Camaro Z28 4 3
我有两个要加入的数据库。虽然我总是有一个 main 相互列来加入,但我 有时 可能在数据中有另一个我想加入的列,在除了主要的。
如何指定 可能 列加入?
例子
我用来自 mtcars
的两个数据集来证明我的问题。两者都有一个“主”列(cars
),我将始终加入该列,有时在一个或两个数据集中可能有另一个相互列(some_letters
)。
library(tidyverse)
create_df <- function(columns_to_include) {
mtcars %>%
rownames_to_column("cars") %>%
select(cars, {{ columns_to_include }}) %>%
slice_sample(n = 15) %>%
{if (sample(c(TRUE, FALSE), size = 1)) add_column(., some_letters = letters[1:15]) else .}
}
# both dataframes have "some_letters"
set.seed(123)
df_a1 <- create_df(carb)
df_a2 <- create_df(gear)
scenario_a <- inner_join(df_a1, df_a2, by = c("cars", "some_letters"))
scenario_a
#> cars carb some_letters gear
#> 1 Ford Pantera L 4 l 5
# neither dataframe has "some_letters"
set.seed(111)
df_b1 <- create_df(carb)
df_b2 <- create_df(gear)
scenario_b <- inner_join(df_b1, df_b2, by = c("cars", "some_letters"))
#> Error: Join columns must be present in data.
#> x Problem with `some_letters`.
# one dataframe has "some_letters" but the other doesn't
set.seed(737)
df_c1 <- create_df(carb)
df_c2 <- create_df(gear)
scenario_c <- inner_join(df_c1, df_c2, by = c("cars", "some_letters"))
#> Error: Join columns must be present in data.
#> x Problem with `some_letters`.
由 reprex package (v0.3.0)
于 2021-02-20 创建我们可以看到在 scenario_a
中连接有效,因为 df_a1
和 df_a2
都包含 some_letters
。但是,在 scenario_b
中,我们看到连接失败,因为 some_letters
不存在(在任一数据中)。同样,scenario_c
显示了 some_letters
出现在一个数据集中但不出现在另一个数据集中的情况,因此连接失败。
加入数据时,我可以指定 some_letters
是可能的,但不保证会出现,这样当它同时出现在两个数据中时,它将成为一个额外的 join-by 列,否则它会是从 by
参数中忽略?
期望的输出
inner_join(df_b1, df_b2, by = c("cars", "some_letters"))
# as if we joined by `cars` only:
## cars carb gear
## 1 Porsche 914-2 2 5
## 2 Cadillac Fleetwood 4 3
## 3 Pontiac Firebird 2 3
## 4 Datsun 710 1 4
## 5 Merc 240D 2 4
## 6 Chrysler Imperial 4 3
## 7 Hornet 4 Drive 1 3
## 8 Camaro Z28 4 3
创建一个包含 intersect
ing 名称的向量
library(dplyr)
library(purrr)
nm1 <- reduce(list(names(df_b1), names(df_b2),
c("cars", "some_letters")), intersect)
然后加入
inner_join(df_b1, df_b2, by = nm1)
-输出
# cars carb gear
#1 Porsche 914-2 2 5
#2 Cadillac Fleetwood 4 3
#3 Pontiac Firebird 2 3
#4 Datsun 710 1 4
#5 Merc 240D 2 4
#6 Chrysler Imperial 4 3
#7 Hornet 4 Drive 1 3
#8 Camaro Z28 4 3