为什么在使用动态变量名时在 Sparklyr 中加 1 实际上加 2？

Question

当我运行下面的代码时，我希望Sepal_Width_2列的值是Sepal_Width + 1，但实际上是Sepal_Width + 2 . 给出了什么？

require(dplyr)
require(sparklyr)

Sys.setenv(SPARK_HOME='/usr/lib/spark')
sc <- spark_connect(master="yarn")

# for this example these variables are hard coded
# but in my actual code these are named dynamically
sw_name <- as.name('Sepal_Width')
sw2 <- "Sepal_Width_2"
sw2_name <- as.name(sw2)

ir <- copy_to(sc, iris)

print(head(ir %>% mutate(!!sw2 := sw_name))) # so far so good
# Source: spark<?> [?? x 6]
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species Sepal_Width_2
# <dbl>       <dbl>        <dbl>       <dbl> <chr>           <dbl>
# 5.1         3.5          1.4         0.2 setosa            3.5
# 4.9         3            1.4         0.2 setosa            3  
# 4.7         3.2          1.3         0.2 setosa            3.2
# 4.6         3.1          1.5         0.2 setosa            3.1
# 5           3.6          1.4         0.2 setosa            3.6
# 5.4         3.9          1.7         0.4 setosa            3.9

print(head(ir %>% mutate(!!sw2 := sw_name) %>% mutate(!!sw2 := sw2_name + 1))) # i guess 2+2 != 4?
# Source: spark<?> [?? x 6]
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species Sepal_Width_2
# <dbl>       <dbl>        <dbl>       <dbl> <chr>           <dbl>
# 5.1         3.5          1.4         0.2 setosa            5.5
# 4.9         3            1.4         0.2 setosa            5  
# 4.7         3.2          1.3         0.2 setosa            5.2
# 4.6         3.1          1.5         0.2 setosa            5.1
# 5           3.6          1.4         0.2 setosa            5.6
# 5.4         3.9          1.7         0.4 setosa            5.9

我的用例要求我使用您在上面看到的动态变量命名。在这个例子中，它相当愚蠢（与直接使用变量相比），但在我的用例中，我运行在数百个不同的 spark table 中使用相同的函数。它们在列数和每列的内容（某些机器学习模型的输出）方面都具有相同的“模式”，但名称不同，因为每个 table 包含不同模型的输出。名称是 predictable，但由于它们各不相同，我按照您在此处看到的那样动态构建它们，而不是对它们进行硬编码。

当名称是硬编码时，Spark 似乎知道如何将 2 和 2 加在一起，但是当名称是动态的时，它突然变得异常。

Answer 1

您可能误用了 as.name，这导致 sparklyr 误解了您的输入。

请注意，您的代码在处理本地 table:

时会出错

sw_name <- as.name('Sepal.Width') # swap "_" to "." to match variable names
sw2 <- "Sepal_Width_2"
sw2_name <- as.name(sw2)
data(iris)

print(head(iris %>% mutate(!!sw2 := sw_name)))
# Error: Problem with `mutate()` input `Sepal_Width_2`.
# x object 'Sepal.Width' not found
# i Input `Sepal_Width_2` is `sw_name`.

请注意，您同时使用了来自 rlang 的 !! 运算符和来自基础 R 的 as.name。但是您没有像问题中所展示的那样一起使用它们。

我建议您使用 rlang 包中的 sym 和 !! 而不是 as.name，并且将两者都应用于作为列名的字符串。以下在本地工作，与non-standard evaluation guidance一致。所以它应该翻译成 spark:

library(dplyr)
data(iris)

sw <- 'Sepal.Width'
sw2 <- paste0(sw, "_2")

head(iris %>% mutate(!!sym(sw2) := !!sym(sw)))
head(iris %>% mutate(!!sym(sw2) := !!sym(sw)) %>% mutate(!!sym(sw2) := !!sym(sw2) + 1))

Answer 2

我不确定哪个包是罪魁祸首（sparklyr、dplyr、R，谁知道呢），但是当我从 3.6.3/sparklyr 1.5 升级到 R 4.0.2/sparklyr 1.7 时，这个问题已经得到修复。 0.

为什么在使用动态变量名时在 Sparklyr 中加 1 实际上加 2？

Why does adding by 1 actually add by 2 in Sparklyr when using dynamic variable names?

r

dplyr

apache-spark

sparklyr