使用变量名称向量合并 left_join 之后的重复变量

Question

我在连接和删除重复项后合并重复列的非 NA 值时遇到反复出现的问题。它类似于 , or 所描述的内容。我想围绕 coalesce（并可能包括 left_join）创建一个小函数，以便在遇到它时在一行中处理它（函数本身当然可以根据需要设置）。

在这样做的过程中，我运行缺少 quo_names 相当于 quos 描述的 here。

对于 reprex，将带有标识信息的数据框与其他包含正确值但经常拼错 ID 的数据框连接起来。

library(dplyr)
library(rlang)

iris_identifiers <- iris %>% 
  select(contains("Petal"), Species)

iris_alt_name1 <- iris %>% 
  mutate(Species = recode(Species, "setosa" = "stosa")) 

iris_alt_name2 <- iris %>%
  mutate(Species = recode(Species, "versicolor" = "verscolor"))

这个更简单的函数有效：

replace_xy <- function(df, var) {

  x_var <- paste0(var, ".x")
  y_var <- paste0(var, ".y")

  df %>% 
    mutate(!! quo_name(var) := coalesce(!! sym(x_var), !! sym(y_var))) %>% 
    select(-(!! sym(x_var)), -(!! sym(y_var)))

}


iris_full <- iris_identifiers %>% 
  left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>% 
  left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>% 
  replace_xy("Sepal.Length") %>% 
  replace_xy("Sepal.Width")


head(iris_full)
#>   Petal.Length Petal.Width Species Sepal.Length Sepal.Width
#> 1          1.4         0.2  setosa          5.1         3.5
#> 2          1.4         0.2  setosa          4.9         3.0
#> 3          1.4         0.2  setosa          5.0         3.6
#> 4          1.4         0.2  setosa          4.4         2.9
#> 5          1.4         0.2  setosa          5.2         3.4
#> 6          1.4         0.2  setosa          5.5         4.2

但是对于如何实现多个变量的泛化我有点迷茫，我认为这会是更容易的部分。下面的代码片段只是一次孤注一掷的尝试——在尝试了多种变体之后——它大致捕捉到了我想要实现的目标。

replace_many_xy <- function(df, vars) {

  x_var <- paste0(vars, ".x")
  y_var <- paste0(vars, ".y")

  df %>% 
    mutate_at(vars(vars), funs(replace_xy(.data, .))) %>% 
    select(-(!!! syms(x_var)), -(!!! syms(y_var)))

}

new_cols <- colnames(iris_alt_name1)
diff_cols <- new_cols [!(new_cols %in% colnames(iris_identifiers))]

iris_full <- iris_identifiers %>% 
  left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>% 
  left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>% 
  replace_many_xy(diff_cols)
#> Warning: Column `Species` joining factors with different levels, coercing
#> to character vector

#> Warning: Column `Species` joining character vector and factor, coercing
#> into character vector
#> Error: Unknown columns `Sepal.Length` and `Sepal.Width`

任何帮助将不胜感激！！

Answer 1

我们可以使用{powerjoin} :

library(powerjoin)
iris_full <- iris_identifiers %>%
  left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
  power_left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width"), conflict  = coalesce_xy) %>%
  head()

iris_full
#   Petal.Length Petal.Width Species Sepal.Length Sepal.Width
# 1          1.4         0.2  setosa          5.1         3.5
# 2          1.4         0.2  setosa          4.9         3.0
# 3          1.4         0.2  setosa          5.0         3.6
# 4          1.4         0.2  setosa          4.4         2.9
# 5          1.4         0.2  setosa          5.2         3.4
# 6          1.4         0.2  setosa          5.5         4.2

power_left_join 是一个改进的 left_join，它允许通过 conflict 参数处理列冲突的一些方法，就像我们在这里所做的那样。

conflict 参数是一个函数，一个接一个地接受成对的冲突列，从右边合并可以使用 need conflict = coalesce_yx

这里有一种方法可以让你的功能发挥作用：

replace_many_xy <- function(tbl, vars){
  for(var in vars){
    x <- paste0(var,".x")
    y <-  paste0(var,".y")
    tbl <- mutate(tbl, !!sym(var) := coalesce(!!sym(x) , !!sym(y) )) %>%
     select(-one_of(x,y))
  }
  tbl
}
iris_full <- iris_identifiers %>%
  left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
  left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>%
  replace_many_xy(diff_cols) %>% as_tibble()
# # A tibble: 372 x 5
#    Petal.Length Petal.Width Species Sepal.Length Sepal.Width
#           <dbl>       <dbl> <chr>          <dbl>       <dbl>
#  1          1.4         0.2 setosa           5.1         3.5
#  2          1.4         0.2 setosa           4.9         3  
#  3          1.4         0.2 setosa           5           3.6
#  4          1.4         0.2 setosa           4.4         2.9
#  5          1.4         0.2 setosa           5.2         3.4
#  6          1.4         0.2 setosa           5.5         4.2
#  7          1.4         0.2 setosa           4.6         3.2
#  8          1.4         0.2 setosa           5           3.3
#  9          1.4         0.2 setosa           5.1         3.5
# 10          1.4         0.2 setosa           4.9         3  
# # ... with 362 more rows

使用变量名称向量合并 left_join 之后的重复变量

Coalescing duplicated variables after left_join using vector of variable names

r

dplyr

tidyeval