使用 dplyr::filter 问题创建 R 函数

Create R function using dplyr::filter problem

我查看了其他答案,但找不到使下面的代码起作用的解决方案。基本上,我正在创建一个函数,该函数 inner_join 两个数据框和 filter 基于函数中输入的列。

问题是函数的 filter 部分不起作用。但是,如果我从函数中删除过滤器并像 mydiff("a") %>% filter(a.x != a.y)

那样附加它,它就会起作用

任何建议都有帮助。

注意我是函数输入引号

library(dplyr)

# fake data
df1<- tibble(id = seq(4,19,2), 
             a = c("a","b","c","d","e","f","g","h"), 
             b = c(rep("foo",3), rep("bar",5)))
df2<- tibble(id = seq(10, 20, 1), 
             a = c("d","a", "e","f","k","m","g","i","h", "a", "b"),
             b = c(rep("bar", 7), rep("foo",4)))

# What I am trying to do
dplyr::inner_join(df1, df2, by = "id") %>% select(id, b.x, b.y) %>% filter(b.x!=b.y)

#> # A tibble: 1 x 3
#>      id b.x   b.y  
#>   <dbl> <chr> <chr>
#> 1    18 bar   foo

# creating a function so that I can filter by difference in column if I have more columns
mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 = paste0(quo_name(filteron), "x")
  col_2 = paste0(quo_name(filteron), "y")
  my_df<- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))
  my_df %>% select(id, col_1, col_2) %>% filter(col_1 != col_2)
}

# the filter part is not working as expected. 
# There is no difference whether i pipe filter or leave it out
mydiff("a")

#> # A tibble: 5 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    10 d     d    
#> 2    12 e     e    
#> 3    14 f     k    
#> 4    16 g     g    
#> 5    18 h     h

对我来说似乎是一个评估问题。使用 lazyeval 包试试这个修改过的 mydiff 函数:

mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 <- paste0(quo_name(filteron), "x")
  col_2 <- paste0(quo_name(filteron), "y")
  criteria <- lazyeval::interp(~ x != y, .values = list(x = as.name(col_1), y = as.name(col_2)))
  my_df <- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))
  my_df %>% select(id, col_1, col_2) %>% filter_(criteria)
}

您可以查看 Hadley Wickham 的书 Advanced R 中的 Functions chapter 了解更多信息。

来自https://dplyr.tidyverse.org/articles/programming.html

Most dplyr functions use non-standard evaluation (NSE). This is a catch-all term that means they don't follow the usual R rules of evaluation.

这有时会在尝试将它们包装在函数中时产生一些问题。 这是您创建的函数的基础版本。

mydiff<- function(filteron, df_1=df1, df_2 = df2){

                 col_1 = paste0(filteron,"x")
                 col_2 = paste0(filteron, "y")

                 my_df <- merge(df1, df2, by="id", suffixes = c("x","y"))

                 my_df[my_df[, col_1] != my_df[, col_2], c("id", col_1, col_2)]  
         }

> mydiff("a")
  id ax ay
3 14  f  k
> mydiff("b")
  id  bx  by
5 18 bar foo

这将解决您的问题,并且现在和将来都可能会像预期的那样工作。通过减少对外部包的依赖,您可以减少此类问题和其他可能在未来随着包作者改进他们的工作而出现的怪癖。

它在您的原始函数中不起作用的原因是 col_1stringdplyr::filter() 预期 "unquoted" 用于 LHS 输入变量。因此,您需要首先使用 sym()col_1 转换为变量,然后使用 !! (bang bang) 在 filter 中取消引用它。

rlang 有非常好的函数 qq_show 来显示 quoting/unquoting 实际发生的事情(见下面的输出)

另请参阅此类似内容

library(rlang)
library(dplyr)

# creating a function that can take either string or symbol as input
mydiff <- function(filteron, df_1 = df1, df_2 = df2) {

  col_1 <- paste0(quo_name(enquo(filteron)), "x")
  col_2 <- paste0(quo_name(enquo(filteron)), "y")

  my_df <- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))

  cat('\nwithout sym and unquote\n')
  qq_show(col_1 != col_2)

  cat('\nwith sym and unquote\n')
  qq_show(!!sym(col_1) != !!sym(col_2))
  cat('\n')

  my_df %>% 
    select(id, col_1, col_2) %>% 
    filter(!!sym(col_1) != !!sym(col_2))
}

### testing: filteron as a string
mydiff("a")
#> 
#> without sym and unquote
#> col_1 != col_2
#> 
#> with sym and unquote
#> ax != ay
#> 
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k

### testing: filteron as a symbol
mydiff(a)
#> 
#> without sym and unquote
#> col_1 != col_2
#> 
#> with sym and unquote
#> ax != ay
#>  
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k

reprex package (v0.2.1.9000)

创建于 2018-09-28

将基本 R 用于简单函数的建议很好,但是它不能扩展到更复杂的 tidyverse 函数,并且您会失去对 dplyr 后端(如数据库)的可移植性。如果您想围绕 tidyverse 管道创建函数,则必须学习一些有关 R 表达式和不引用运算符 !! 的知识。我建议浏览一下 https://tidyeval.tidyverse.org 的第一部分,以大致了解此处使用的概念。

由于您要创建的函数采用裸列名称并且不涉及复杂表达式(就像您将传递给 mutate()summarise()),我们不需要花哨的诸如quosures之类的东西。我们可以使用符号。要创建符号,请使用 as.name()rlang::sym()

as.name("mycolumn")
#> mycolumn

rlang::sym("mycolumn")
#> mycolumn

后者的优点是属于更大的函数家族:ensym(),以及复数变体 syms()ensyms()。我们将使用 ensym() 来捕获列名,即延迟列的执行,以便在几次转换后将其传递给 dplyr。延迟执行称为"quoting".

我对你的功能界面做了一些修改:

  • 先取数据帧与dplyr函数保持一致

  • 不要为数据框提供默认值。这些默认值做出了太多假设。

  • 使 bysuffix 用户可配置,具有合理的默认值。

这是代码,内联解释:

mydiff <- function(df1, df2, var, by = "id", suffix = c(".x", ".y")) {
  stopifnot(is.character(suffix), length(suffix) == 2)

  # Let's start by the easy task, joining the data frames
  df <- dplyr::inner_join(df1, df2, by = by, suffix = suffix)

  # Now onto dealing with the diff variable. `ensym()` takes a column
  # name and delays its execution:
  var <- rlang::ensym(var)

  # A delayed column name is not a string, it's a symbol. So we need
  # to transform it to a string in order to work with paste() etc.
  # `quo_name()` works in this case but is generally only for
  # providing default names.
  #
  # Better use base::as.character() or rlang::as_string() (the latter
  # works a bit better on Windows with foreign UTF-8 characters):
  var_string <- rlang::as_string(var)

  # Now let's add the suffix to the name:
  col1_string <- paste0(var_string, suffix[[1]])
  col2_string <- paste0(var_string, suffix[[2]])

  # dplyr::select() supports column names as strings but it is an
  # exception in the dplyr API. Generally, dplyr functions take bare
  # column names, i.e. symbols. So let's transform the strings back to
  # symbols:
  col1 <- rlang::sym(col1_string)
  col2 <- rlang::sym(col2_string)

  # The delayed column names now need to be inserted back into the
  # dplyr code. This is accomplished by unquoting with the !!
  # operator:
  df %>%
    dplyr::select(id, !!col1, !!col2) %>%
    dplyr::filter(!!col1 != !!col2)
}

mydiff(df1, df2, b)
#> # A tibble: 1 x 3
#>      id b.x   b.y
#>   <dbl> <chr> <chr>
#> 1    18 bar   foo

mydiff(df1, df2, "a")
#> # A tibble: 1 x 3
#>      id a.x   a.y
#>   <dbl> <chr> <chr>
#> 1    14 f     k

您还可以通过使用字符串而不是裸列名称来简化函数。在这个版本中,我将使用 syms() 创建一个符号列表,并使用 !!! 将其一次性全部传递给 select():

mydiff2 <- function(df1, df2, var, by = "id", suffix = c(".x", ".y")) {
  stopifnot(
    is.character(suffix), length(suffix) == 2,
    is.character(var), length(var) == 1
  )

  # Create a list of symbols from a character vector:
  cols <- rlang::syms(paste0(var, suffix))

  df <- dplyr::inner_join(df1, df2, by = by, suffix = suffix)

  # Unquote the whole list as once with the big bang !!!
  df %>%
    dplyr::select(id, !!!cols) %>%
    dplyr::filter(!!cols[[1]] != !!cols[[2]])
}

mydiff2(df1, df2, "a")
#> # A tibble: 1 x 3
#>      id a.x   a.y
#>   <dbl> <chr> <chr>
#> 1    14 f     k

首先为 col_1 != col_2 查找索引可能足以解决此问题。

mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 <- paste0(quo_name(filteron), "x")
  col_2 <- paste0(quo_name(filteron), "y")
  my_df <-
    inner_join(df_1, df_2, by = "id", suffix = c("x", "y")) %>%
    select(id, col_1, col_2)
  # find indices of different columns
  same <- my_df[, col_1] != my_df[, col_2]
  # return for the rows
  my_df[same, ]
}
my_diff("a")
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k