使用 rsparkling 在 Databricks 上启动 H2O 上下文
Start H2O context on Databricks with rsparkling
问题
我想在 Azure Databricks 的多节点集群上使用 H2O 的 Sparkling Water,分别通过 RStudio 和 R notebooks 以交互方式和在作业中使用。我可以在本地机器上的 rocker/verse:4.0.3
和 databricksruntime/rbase:latest
(以及 databricksruntime/standard
)Docker 容器上启动 H2O 集群和 Sparkling Water 上下文,但目前不在数据块集群。似乎有一个经典的类路径问题。
Error : java.lang.ClassNotFoundException: ai.h2o.sparkling.H2OConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at sparklyr.StreamHandler.handleMethodCall(stream.scala:106)
at sparklyr.StreamHandler.read(stream.scala:61)
at sparklyr.BackendHandler.$anonfun$channelRead0(handler.scala:58)
at scala.util.control.Breaks.breakable(Breaks.scala:42)
at sparklyr.BackendHandler.channelRead0(handler.scala:39)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:321)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:295)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
我试过的
设置:单节点 Azure Databricks 集群,7.6 ML(包括 Apache Spark 3.0.1、Scala 2.12)和“Standard_F4s”驱动程序(我的用例是多节点,但我试图保留一些东西简单)
设置 options()
,例如 options(rsparkling.sparklingwater.version = "2.3.11")
或 options(rsparkling.sparklingwater.version = "3.0.1")
设置config
,例如
conf$`sparklyr.shell.jars` <- c("/databricks/spark/R/lib/h2o/java/h2o.jar")
或sc <- sparklyr::spark_connect(method = "databricks", version = "3.0.1", config = conf, jars = c("/databricks/spark/R/lib/h2o/java/h2o.jar"))
(或"~/R/x86_64-pc-linux-gnu-library/3.6/h2o/java/h2o.jar"
或"~/R/x86_64-pc-linux-gnu-library/3.6/rsparkling/java/sparkling_water_assembly.jar"
作为Databricks RStudio上的.jar位置)
- 按照此处的说明进行操作:http://docs.h2o.ai/sparkling-water/3.0/latest-stable/doc/deployment/rsparkling_azure_dbc.html
For Sparkling Water 3.32.1.1-1-3.0 select Spark 3.0.2
Spark 3.0.2 不可用作集群,在我的其余方法中选择 3.0.1
Error in h2o_context(sc) : could not find function "h2o_context"
Docker在本地机器上工作的文件
# get the base image (https://hub.docker.com/r/databricksruntime/standard; https://github.com/databricks/containers/blob/master/ubuntu/standard/Dockerfile)
FROM databricksruntime/standard
# not needed if using `FROM databricksruntime/r-base:latest` at top
ENV DEBIAN_FRONTEND noninteractive
# go into the repo directory
RUN . /etc/environment \
# Install linux depedendencies here
&& apt-get update \
&& apt-get install libcurl4-openssl-dev libxml2-dev libssl-dev -y \
# not needed if using `FROM databricksruntime/r-base:latest` at top
&& apt-get install r-base -y
# install specific R packages
RUN R -e 'install.packages(c("httr", "xml2"))'
# sparklyr and Spark
RUN R -e 'install.packages(c("sparklyr"))'
# h2o
# RSparkling 3.32.0.5-1-3.0 requires H2O of version 3.32.0.5.
RUN R -e 'install.packages(c("statmod", "RCurl"))'
RUN R -e 'install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zermelo/5/R")'
# rsparkling
# RSparkling 3.32.0.5-1-3.0 is built for 3.0.
RUN R -e 'install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.32.0.5-1-3.0/R")'
# connect to H2O cluster with Sparkling Water context
RUN R -e 'library(sparklyr); sparklyr::spark_install("3.0.1", hadoop_version = "3.2"); Sys.setenv(SPARK_HOME = "~/spark/spark-3.0.1-bin-hadoop3.2"); library(rsparkling); sc <- sparklyr::spark_connect(method = "databricks", version = "3.0.1"); sparklyr::spark_version(sc); h2oConf <- H2OConf(); hc <- H2OContext.getOrCreate(h2oConf)'
就我而言,我需要将“Library”安装到我的 Databricks 工作区、集群或作业中。我可以上传它,也可以让 Databricks 从 Maven 坐标中获取它。
在 Databricks 工作区中:
- 单击主页图标
- 单击“共享”>“创建”>“库”
- 单击“Maven”(作为“库源”)
- 单击“坐标”框旁边的“搜索包”link
- 单击下拉框并选择“Maven Central”
- 在“查询”框中输入
ai.h2o.sparkling-water-package
- 选择与你的
rsparkling
版本相匹配的最近的“工件 ID”和“版本”,对我来说 ai.h2o:sparkling-water-package_2.12:3.32.0.5-1-3.0
- 单击“选项”下的“Select”
- 点击“创建”创建库
- 谢天谢地,当 运行 作为 Databricks 作业时,这不需要对我的 Databricks R Notebook 进行任何更改
# install specific R packages
install.packages(c("httr", "xml2"))
# sparklyr and Spark
install.packages(c("sparklyr"))
# h2o
# RSparkling 3.32.0.5-1-3.0 requires H2O of version 3.32.0.5.
install.packages(c("statmod", "RCurl"))
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zermelo/5/R")
# rsparkling
# RSparkling 3.32.0.5-1-3.0 is built for 3.0.
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.32.0.5-1-3.0/R")
# connect to H2O cluster with Sparkling Water context
library(sparklyr)
sparklyr::spark_install("3.0.1", hadoop_version = "3.2")
Sys.setenv(SPARK_HOME = "~/spark/spark-3.0.1-bin-hadoop3.2")
sparklyr::spark_default_version()
library(rsparkling)
SparkR::sparkR.session()
sc <- sparklyr::spark_connect(method = "databricks", version = "3.0.1")
sparklyr::spark_version(sc)
# next command will not work without adding https://mvnrepository.com/artifact/ai.h2o/sparkling-water-package_2.12/3.32.0.5-1-3.0 file as "Library" to Databricks cluster
h2oConf <- H2OConf()
hc <- H2OContext.getOrCreate(h2oConf)
问题
我想在 Azure Databricks 的多节点集群上使用 H2O 的 Sparkling Water,分别通过 RStudio 和 R notebooks 以交互方式和在作业中使用。我可以在本地机器上的 rocker/verse:4.0.3
和 databricksruntime/rbase:latest
(以及 databricksruntime/standard
)Docker 容器上启动 H2O 集群和 Sparkling Water 上下文,但目前不在数据块集群。似乎有一个经典的类路径问题。
Error : java.lang.ClassNotFoundException: ai.h2o.sparkling.H2OConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at sparklyr.StreamHandler.handleMethodCall(stream.scala:106)
at sparklyr.StreamHandler.read(stream.scala:61)
at sparklyr.BackendHandler.$anonfun$channelRead0(handler.scala:58)
at scala.util.control.Breaks.breakable(Breaks.scala:42)
at sparklyr.BackendHandler.channelRead0(handler.scala:39)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:321)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:295)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
我试过的
设置:单节点 Azure Databricks 集群,7.6 ML(包括 Apache Spark 3.0.1、Scala 2.12)和“Standard_F4s”驱动程序(我的用例是多节点,但我试图保留一些东西简单)
设置
options()
,例如options(rsparkling.sparklingwater.version = "2.3.11")
或options(rsparkling.sparklingwater.version = "3.0.1")
设置
config
,例如conf$`sparklyr.shell.jars` <- c("/databricks/spark/R/lib/h2o/java/h2o.jar")
或sc <- sparklyr::spark_connect(method = "databricks", version = "3.0.1", config = conf, jars = c("/databricks/spark/R/lib/h2o/java/h2o.jar"))
(或"~/R/x86_64-pc-linux-gnu-library/3.6/h2o/java/h2o.jar"
或"~/R/x86_64-pc-linux-gnu-library/3.6/rsparkling/java/sparkling_water_assembly.jar"
作为Databricks RStudio上的.jar位置)
- 按照此处的说明进行操作:http://docs.h2o.ai/sparkling-water/3.0/latest-stable/doc/deployment/rsparkling_azure_dbc.html
For Sparkling Water 3.32.1.1-1-3.0 select Spark 3.0.2
Spark 3.0.2 不可用作集群,在我的其余方法中选择 3.0.1
Error in h2o_context(sc) : could not find function "h2o_context"
Docker在本地机器上工作的文件
# get the base image (https://hub.docker.com/r/databricksruntime/standard; https://github.com/databricks/containers/blob/master/ubuntu/standard/Dockerfile)
FROM databricksruntime/standard
# not needed if using `FROM databricksruntime/r-base:latest` at top
ENV DEBIAN_FRONTEND noninteractive
# go into the repo directory
RUN . /etc/environment \
# Install linux depedendencies here
&& apt-get update \
&& apt-get install libcurl4-openssl-dev libxml2-dev libssl-dev -y \
# not needed if using `FROM databricksruntime/r-base:latest` at top
&& apt-get install r-base -y
# install specific R packages
RUN R -e 'install.packages(c("httr", "xml2"))'
# sparklyr and Spark
RUN R -e 'install.packages(c("sparklyr"))'
# h2o
# RSparkling 3.32.0.5-1-3.0 requires H2O of version 3.32.0.5.
RUN R -e 'install.packages(c("statmod", "RCurl"))'
RUN R -e 'install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zermelo/5/R")'
# rsparkling
# RSparkling 3.32.0.5-1-3.0 is built for 3.0.
RUN R -e 'install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.32.0.5-1-3.0/R")'
# connect to H2O cluster with Sparkling Water context
RUN R -e 'library(sparklyr); sparklyr::spark_install("3.0.1", hadoop_version = "3.2"); Sys.setenv(SPARK_HOME = "~/spark/spark-3.0.1-bin-hadoop3.2"); library(rsparkling); sc <- sparklyr::spark_connect(method = "databricks", version = "3.0.1"); sparklyr::spark_version(sc); h2oConf <- H2OConf(); hc <- H2OContext.getOrCreate(h2oConf)'
就我而言,我需要将“Library”安装到我的 Databricks 工作区、集群或作业中。我可以上传它,也可以让 Databricks 从 Maven 坐标中获取它。
在 Databricks 工作区中:
- 单击主页图标
- 单击“共享”>“创建”>“库”
- 单击“Maven”(作为“库源”)
- 单击“坐标”框旁边的“搜索包”link
- 单击下拉框并选择“Maven Central”
- 在“查询”框中输入
ai.h2o.sparkling-water-package
- 选择与你的
rsparkling
版本相匹配的最近的“工件 ID”和“版本”,对我来说ai.h2o:sparkling-water-package_2.12:3.32.0.5-1-3.0
- 单击“选项”下的“Select”
- 点击“创建”创建库
- 谢天谢地,当 运行 作为 Databricks 作业时,这不需要对我的 Databricks R Notebook 进行任何更改
# install specific R packages
install.packages(c("httr", "xml2"))
# sparklyr and Spark
install.packages(c("sparklyr"))
# h2o
# RSparkling 3.32.0.5-1-3.0 requires H2O of version 3.32.0.5.
install.packages(c("statmod", "RCurl"))
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zermelo/5/R")
# rsparkling
# RSparkling 3.32.0.5-1-3.0 is built for 3.0.
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.32.0.5-1-3.0/R")
# connect to H2O cluster with Sparkling Water context
library(sparklyr)
sparklyr::spark_install("3.0.1", hadoop_version = "3.2")
Sys.setenv(SPARK_HOME = "~/spark/spark-3.0.1-bin-hadoop3.2")
sparklyr::spark_default_version()
library(rsparkling)
SparkR::sparkR.session()
sc <- sparklyr::spark_connect(method = "databricks", version = "3.0.1")
sparklyr::spark_version(sc)
# next command will not work without adding https://mvnrepository.com/artifact/ai.h2o/sparkling-water-package_2.12/3.32.0.5-1-3.0 file as "Library" to Databricks cluster
h2oConf <- H2OConf()
hc <- H2OContext.getOrCreate(h2oConf)