使用 Spark 将函数中的多个列名传递给 dplyr::distinct()

Pass multiple column names in function to dplyr::distinct() with Spark

我想在将使用 dplyr::distinct() 的函数中指定未知数量的列名。我目前的尝试是:

myFunction <- function(table, id) {
  table %>%
    dplyr::distinct(.data[[id]])
}

我正在尝试上述 [.data[[id]]] 因为 the data-masking section of this dplyr blog 声明:

When you have an env-variable that is a character vector, you need to index into the .data pronoun with [[, like summarise(df, mean = mean(.data[[var]])).

dplyr::distinct() 的文档说明了它的第二个参数:

<data-masking> Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables.

火花

更具体地说,我正在尝试将此功能与 Spark 结合使用。

sc <- sparklyr::spark_connect(local = "master")
mtcars_tbl <- sparklyr::copy_to(sc, mtcars, "mtcars_spark")

##### desired return
mtcars_tbl %>% dplyr::distinct(cyl, gear)
# Source: spark<?> [?? x 2]
    cyl  gear
  <dbl> <dbl>
1     6     4
2     4     4
3     6     3
4     8     3
5     4     3
6     4     5
7     8     5
8     6     5

##### myFunction fails
id = c("cyl", "gear")
myFunction(mtcars_tbl, id)
 Error: Can't convert a call to a string
Run `rlang::last_error()` to see where the error occurred. 

this comment 之后,我还有其他失败的尝试:

myFunction <- function(table, id) {
    table %>%
        dplyr::distinct(.dots = id)
}

myFunction(mtcars_tbl, id)
# Source: spark<?> [?? x 1]
  .dots           
  <list>          
1 <named list [2]>


#####


myFunction <- function(table, id) {
    table %>%
        dplyr::distinct_(id)
}

myFunction(mtcars_tbl, id)
Error in UseMethod("distinct_") : 
  no applicable method for 'distinct_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

Distinct 一次应用于 table 的所有列。考虑一个例子 table:

A     B
1     4
1     4
2     3
2     3
3     3
3     5

尚不清楚只对 A 列应用 distinct 而不对 B 列应用 distinct return。下面的例子显然不是一个好的选择,因为它破坏了A列和B列之间的关系。例如,原始数据集中没有(A = 2, B = 4)行。

A     B
1     4
2     4
3     3
      3
      3
      5

因此,最好的方法是 select 先只选择您想要的那些列,然后再采用不同的列。更像是:

myFunction <- function(table, id) {
  table %>%
    dplyr::select(dplyr::all_of(id)) %>%
    dplyr::distinct()
}