sparklyr:使用 mutate 函数创建新列
sparklyr: create new column with mutate function
如果这种问题不能用sparklyr解决,我很惊讶:
iris_tbl <- copy_to(sc, aDataFrame)
# date_vector is a character vector of element
# in this format: YYYY-MM-DD (year, month, day)
for (d in date_vector) {
...
aDataFrame %>% mutate(newValue=gsub("-","",d)))
...
}
我收到此错误:
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:787)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun$applyOrElse$$anonfun$applyOrElse.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun$applyOrElse$$anonfun$applyOrElse.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun
但是这一行:
aDataFrame %>% mutate(newValue=toupper("hello"))
一切正常。有帮助吗?
我强烈建议您在继续之前阅读 sparklyr
文档。特别是,您需要阅读有关如何将 R 翻译成 SQL (http://spark.rstudio.com/dplyr.html#sql_translation) 的部分。简而言之,非常有限的 R 函数子集可用于 sparklyr
数据帧,并且 gsub
不是这些函数之一(但 toupper
是)。如果你真的需要 gsub
你将不得不 collect
将数据放入本地数据帧,然后 gsub
它(你仍然可以使用 mutate
),然后 copy_to
返回 spark.
可能值得补充的是,可用的文档指出:
Hive Functions
Many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize. The Languange Reference UDF page provides the list of available functions.
蜂巢
如文档中所述,使用 regexp_replace
:
应该可以实现可行的解决方案
Returns the string resulting from replacing all substrings in
INITIAL_STRING
that match the java regular expression syntax defined
in PATTERN
with instances of REPLACEMENT.
For example,
regexp_replace("foobar", "oo|ar", "")
returns 'fb.'
Note that some
care is necessary in using predefined character classes: using '\s'
as
the second argument will match the letter s; '\s'
is necessary to
match whitespace, etc.
sparklyr
接近
考虑到上述情况,应该可以将 sparklyr
管道与
regexp_replace
实现与在所需列上应用 gsub
同源的效果。在变量 d
中删除 -
字符的测试代码可以构建如下:
aDataFrame %>%
mutate(clnD = regexp_replace(d, "-", "")) %>%
# ...
其中 class(aDataFrame )
returns: "tbl_spark" ...
.
如果这种问题不能用sparklyr解决,我很惊讶:
iris_tbl <- copy_to(sc, aDataFrame)
# date_vector is a character vector of element
# in this format: YYYY-MM-DD (year, month, day)
for (d in date_vector) {
...
aDataFrame %>% mutate(newValue=gsub("-","",d)))
...
}
我收到此错误:
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:787)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun$applyOrElse$$anonfun$applyOrElse.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun$applyOrElse$$anonfun$applyOrElse.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun
但是这一行:
aDataFrame %>% mutate(newValue=toupper("hello"))
一切正常。有帮助吗?
我强烈建议您在继续之前阅读 sparklyr
文档。特别是,您需要阅读有关如何将 R 翻译成 SQL (http://spark.rstudio.com/dplyr.html#sql_translation) 的部分。简而言之,非常有限的 R 函数子集可用于 sparklyr
数据帧,并且 gsub
不是这些函数之一(但 toupper
是)。如果你真的需要 gsub
你将不得不 collect
将数据放入本地数据帧,然后 gsub
它(你仍然可以使用 mutate
),然后 copy_to
返回 spark.
可能值得补充的是,可用的文档指出:
Hive Functions
Many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize. The Languange Reference UDF page provides the list of available functions.
蜂巢
如文档中所述,使用 regexp_replace
:
Returns the string resulting from replacing all substrings in
INITIAL_STRING
that match the java regular expression syntax defined inPATTERN
with instances ofREPLACEMENT.
For example,regexp_replace("foobar", "oo|ar", "")
returns'fb.'
Note that some care is necessary in using predefined character classes: using'\s'
as the second argument will match the letters; '\s'
is necessary to match whitespace, etc.
sparklyr
接近
考虑到上述情况,应该可以将 sparklyr
管道与
regexp_replace
实现与在所需列上应用 gsub
同源的效果。在变量 d
中删除 -
字符的测试代码可以构建如下:
aDataFrame %>%
mutate(clnD = regexp_replace(d, "-", "")) %>%
# ...
其中 class(aDataFrame )
returns: "tbl_spark" ...
.