sparklyr：使用 mutate 函数创建新列

Question

如果这种问题不能用sparklyr解决，我很惊讶：

iris_tbl <- copy_to(sc, aDataFrame)

# date_vector is a character vector of element
# in this format: YYYY-MM-DD (year, month, day)
for (d in date_vector) {
   ...
   aDataFrame %>% mutate(newValue=gsub("-","",d)))
   ...
}

我收到此错误：

Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:787)
    at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
    at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun$applyOrElse$$anonfun$applyOrElse.apply(Analyzer.scala:884)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun$applyOrElse$$anonfun$applyOrElse.apply(Analyzer.scala:884)
    at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun

但是这一行：

aDataFrame %>% mutate(newValue=toupper("hello"))

一切正常。有帮助吗？

Answer 1

我强烈建议您在继续之前阅读 sparklyr 文档。特别是，您需要阅读有关如何将 R 翻译成 SQL (http://spark.rstudio.com/dplyr.html#sql_translation) 的部分。简而言之，非常有限的 R 函数子集可用于 sparklyr 数据帧，并且 gsub 不是这些函数之一（但 toupper 是）。如果你真的需要 gsub 你将不得不 collect 将数据放入本地数据帧，然后 gsub 它（你仍然可以使用 mutate），然后 copy_to 返回 spark.

Answer 2

可能值得补充的是，可用的文档指出：

Hive Functions

Many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize. The Languange Reference UDF page provides the list of available functions.

蜂巢

如文档中所述，使用 regexp_replace:

应该可以实现可行的解决方案

Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. For example, regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc.

`sparklyr`接近

考虑到上述情况，应该可以将 sparklyr 管道与 regexp_replace 实现与在所需列上应用 gsub 同源的效果。在变量 d 中删除 - 字符的测试代码可以构建如下：

aDataFrame %>% 
  mutate(clnD = regexp_replace(d, "-", "")) %>%
  # ...

其中 class(aDataFrame ) returns: "tbl_spark" ....

sparklyr：使用 mutate 函数创建新列

sparklyr: create new column with mutate function

r

apache-spark

sparklyr

Hive Functions

蜂巢

`sparklyr`接近

sparklyr：使用 mutate 函数创建新列

sparklyr: create new column with mutate function

r

apache-spark

sparklyr

Hive Functions

蜂巢

sparklyr接近

`sparklyr`接近