sparklyr::sdf_quantile() 错误

sparklyr::sdf_quantile() error

我知道 spark 1.6.0 可能已经过时了,但我们有它在我们的堆栈中。尝试使用 sparklyr::sdf_quantile().

mtc <- copy_to(sc, mtcars, "mtcars")
mtc %>% sdf_quantile("hp")

我收到以下错误(通过 yarn 使用 spark 1.6.0):

Error: java.lang.IllegalArgumentException: invalid method approxQuantile for object 168
    at sparklyr.Invoke$.invoke(invoke.scala:122)
    at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
    at sparklyr.StreamHandler$.read(stream.scala:62)
    at sparklyr.BackendHandler.channelRead0(handler.scala:52)
    at sparklyr.BackendHandler.channelRead0(handler.scala:14)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:111)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)

这是我的这台机器的 sessionInfo()。

sessionInfo()
Oracle Distribution of R version 3.3.0  (--)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Oracle Linux Server 7.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] kudusparklyr_0.1.0  sparklyr_0.7.0      dbplot_0.2.0        rlang_0.1.4        
 [5] bindrcpp_0.2        anytime_0.3.0       jsonlite_1.5        magrittr_1.5       
 [9] ggplot2_2.2.1       DBI_0.7             dtplyr_0.0.2        dplyr_0.7.4        
[13] data.table_1.10.4-3 devtools_1.13.4     httr_1.3.1         

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14       dbplyr_1.1.0       plyr_1.8.4         bindr_0.1         
 [5] base64enc_0.1-3    tools_3.3.0        digest_0.6.12      lattice_0.20-33   
 [9] nlme_3.1-127       memoise_1.1.0      tibble_1.3.4       gtable_0.2.0      
[13] pkgconfig_2.0.1    psych_1.7.8        shiny_1.0.5        rstudioapi_0.7    
[17] yaml_2.1.15        parallel_3.3.0     stringr_1.2.0      withr_2.1.0       
[21] rprojroot_1.2      grid_3.3.0         glue_1.2.0         R6_2.2.2          
[25] foreign_0.8-66     reshape2_1.4.2     purrr_0.2.4        tidyr_0.7.2       
[29] scales_0.5.0       backports_1.1.1    htmltools_0.3.6    mnormt_1.5-5      
[33] assertthat_0.2.0   xtable_1.8-2       mime_0.5           RApiDatetime_0.0.3
[37] colorspace_1.3-2   httpuv_1.3.5       labeling_0.3       config_0.2        
[41] stringi_1.1.6      openssl_0.9.9      lazyeval_0.2.1     munsell_0.4.3     
[45] broom_0.4.3

在另一台机器上(本地有 spark 2.2.0)它是 运行:

mtc %>% sdf_quantile("hp")
  0%  25%  50%  75% 100% 
  52   95  123  180  335

具有以下会话信息:

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Austria.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rsparkling_0.2.2    leaflet_1.1.0       dplyr_0.7.4         purrr_0.2.4        
 [5] readr_1.1.1         tidyr_0.6.1         tibble_1.4.1        ggplot2_2.2.1      
 [9] tidyverse_1.1.1     sparklyr_0.7.0-9030

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12     lubridate_1.6.0  lattice_0.20-35  assertthat_0.2.0 rprojroot_1.2   
 [6] digest_0.6.12    psych_1.7.3.21   mime_0.5         R6_2.2.2         cellranger_1.1.0
[11] plyr_1.8.4       backports_1.0.5  evaluate_0.10    httr_1.2.1       pillar_1.0.1    
[16] rlang_0.1.6      lazyeval_0.2.0   readxl_1.0.0     rstudioapi_0.7   rmarkdown_1.6   
[21] config_0.2       stringr_1.2.0    foreign_0.8-69   htmlwidgets_0.8  RCurl_1.95-4.8  
[26] munsell_0.4.3    shiny_1.0.5      broom_0.4.2      compiler_3.4.1   httpuv_1.3.5    
[31] modelr_0.1.0     pkgconfig_2.0.1  base64enc_0.1-3  mnormt_1.5-5     htmltools_0.3.5 
[36] openssl_0.9.7    withr_2.0.0      dbplyr_1.2.0     rappdirs_0.3.1   bitops_1.0-6    
[41] grid_3.4.1       nlme_3.1-131     jsonlite_1.5     xtable_1.8-2     gtable_0.2.0    
[46] DBI_0.7          magrittr_1.5     scales_0.4.1     stringi_1.1.3    reshape2_1.4.2  
[51] bindrcpp_0.2     xml2_1.1.1       tools_3.4.1      forcats_0.2.0    glue_1.2.0      
[56] hms_0.3          crosstalk_1.0.0  parallel_3.4.1   yaml_2.1.14      colorspace_1.3-2
[61] h2o_3.14.0.2     rvest_0.3.2      knitr_1.15.1     bindr_0.1        haven_1.0.0

知道出了什么问题吗?

approxQuantile 已在 Spark 2.0 中引入 - SPARK-6761。您必须更新 Apache Spark 安装才能使用它。

如果您启用了 Hive 支持,您可以尝试 percentile_approx Hive 功能:

df <- copy_to(sc, iris)

sc %>% spark_session() %>%
  invoke("sql", "SELECT percentile_approx(Sepal_Length, 0.5) FROM iris") %>% 
  sdf_register("median")

# # Source:   table<median> [?? x 1]
# # Database: spark_connection
#   `_c0`
#   <dbl>
# 1  5.73