是否可以将本机 R 代码或其他 R 包函数与 sparklyr 一起使用?
is it possible to use native R code or other R package functions with sparklyr?
我已经到了可以按照示例 here 进行操作的地步(只需对输入参数添加 config=list()
进行轻微修改)。
sc <- spark_connect(master = "yarn-client", config=list())
library(dplyr)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
flights_tbl %>% filter(dep_delay == 2)
Source: query [?? x 16]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
<int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 2 830 11 "UA" "N14228" 1545 "EWR" "IAH" 227 1400 5 17
2 2013 1 1 542 2 923 33 "AA" "N619AA" 1141 "JFK" "MIA" 160 1089 5 42
3 2013 1 1 702 2 1058 44 "B6" "N779JB" 671 "JFK" "LAX" 381 2475 7 2
4 2013 1 1 715 2 911 21 "UA" "N841UA" 544 "EWR" "ORD" 156 719 7 15
5 2013 1 1 752 2 1025 -4 "UA" "N511UA" 477 "LGA" "DEN" 249 1620 7 52
6 2013 1 1 917 2 1206 -5 "B6" "N568JB" 41 "JFK" "MCO" 145 944 9 17
7 2013 1 1 932 2 1219 -6 "VX" "N641VA" 251 "JFK" "LAS" 324 2248 9 32
8 2013 1 1 1028 2 1350 11 "UA" "N76508" 1004 "LGA" "IAH" 237 1416 10 28
9 2013 1 1 1042 2 1325 -1 "B6" "N529JB" 31 "JFK" "MCO" 142 944 10 42
10 2013 1 1 1231 2 1523 -6 "UA" "N402UA" 428 "EWR" "FLL" 156 1065 12 31
# ... with more rows
但是,当我尝试像使用 dplyr
那样使用其他 R 函数时,出现了问题:
flights_tbl %>% filter(dep_delay == 2 & grepl("A$", tailnum))
Source: query [?? x 16]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
Error: org.apache.spark.sql.AnalysisException: undefined function GREPL; line 4 pos 41
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$$anonfun.apply(hiveUDFs.scala:69)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$$anonfun.apply(hiveUDFs.scala:69)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction.apply(hiveUDFs.scala:68)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction.apply(hiveUDFs.scala:64)
at scala.util.Try.getOrElse(Try.scala:77)
at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:64)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun$applyOrElse$$anonfun$applyOrElse.apply(Analyzer.scala:574)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun$applyOrElse$$anonfun$applyOrElse.apply(Analyzer.
显然不支持grepl
。我的问题是:有没有办法使用基本 R 或 R 包函数?如果没有,它会来吗?似乎 dapply
和 gapply
在 SparkR
v2 中沿着这些方向的工作正在取得进展,但如果它与 sparklyr
.
一起工作会很棒
刚刚看到 this issue 的 sparklyr。简短的回答是 "not yet"。期待添加此功能的未来版本。
我已经到了可以按照示例 here 进行操作的地步(只需对输入参数添加 config=list()
进行轻微修改)。
sc <- spark_connect(master = "yarn-client", config=list())
library(dplyr)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
flights_tbl %>% filter(dep_delay == 2)
Source: query [?? x 16]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
<int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 2 830 11 "UA" "N14228" 1545 "EWR" "IAH" 227 1400 5 17
2 2013 1 1 542 2 923 33 "AA" "N619AA" 1141 "JFK" "MIA" 160 1089 5 42
3 2013 1 1 702 2 1058 44 "B6" "N779JB" 671 "JFK" "LAX" 381 2475 7 2
4 2013 1 1 715 2 911 21 "UA" "N841UA" 544 "EWR" "ORD" 156 719 7 15
5 2013 1 1 752 2 1025 -4 "UA" "N511UA" 477 "LGA" "DEN" 249 1620 7 52
6 2013 1 1 917 2 1206 -5 "B6" "N568JB" 41 "JFK" "MCO" 145 944 9 17
7 2013 1 1 932 2 1219 -6 "VX" "N641VA" 251 "JFK" "LAS" 324 2248 9 32
8 2013 1 1 1028 2 1350 11 "UA" "N76508" 1004 "LGA" "IAH" 237 1416 10 28
9 2013 1 1 1042 2 1325 -1 "B6" "N529JB" 31 "JFK" "MCO" 142 944 10 42
10 2013 1 1 1231 2 1523 -6 "UA" "N402UA" 428 "EWR" "FLL" 156 1065 12 31
# ... with more rows
但是,当我尝试像使用 dplyr
那样使用其他 R 函数时,出现了问题:
flights_tbl %>% filter(dep_delay == 2 & grepl("A$", tailnum))
Source: query [?? x 16]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
Error: org.apache.spark.sql.AnalysisException: undefined function GREPL; line 4 pos 41
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$$anonfun.apply(hiveUDFs.scala:69)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$$anonfun.apply(hiveUDFs.scala:69)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction.apply(hiveUDFs.scala:68)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction.apply(hiveUDFs.scala:64)
at scala.util.Try.getOrElse(Try.scala:77)
at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:64)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun$applyOrElse$$anonfun$applyOrElse.apply(Analyzer.scala:574)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$$anonfun$applyOrElse$$anonfun$applyOrElse.apply(Analyzer.
显然不支持grepl
。我的问题是:有没有办法使用基本 R 或 R 包函数?如果没有,它会来吗?似乎 dapply
和 gapply
在 SparkR
v2 中沿着这些方向的工作正在取得进展,但如果它与 sparklyr
.
刚刚看到 this issue 的 sparklyr。简短的回答是 "not yet"。期待添加此功能的未来版本。