sparklyr 可以与部署在 yarn-managed hadoop 集群上的 spark 一起使用吗?
Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?
是sparklyr
R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment文档。使用 Spark 附带的 SparkR
包,可以通过以下方式实现:
# set R environment variables
Sys.setenv(YARN_CONF_DIR=...)
Sys.setenv(SPARK_CONF_DIR=...)
Sys.setenv(LD_LIBRARY_PATH=...)
Sys.setenv(SPARKR_SUBMIT_ARGS=...)
spark_lib_dir <- ... # install specific
library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths()))
sc <- sparkR.init(master = "yarn-client")
然而,当我将上面的最后几行与
交换时
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
我收到错误:
Error in start_shell(scon, list(), jars, packages) :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/hdp/2.4.2.0-258/spark/bin/spark-submit
Parameters: '--packages' 'com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34' '--jars' '<path to R lib>/3.2/sparklyr/java/rspark_utils.jar' sparkr-shell /tmp/RtmpT31OQT/filecfb07d7f8bfd.out
Ivy Default Cache set to: /home/mpollock/.ivy2/cache
The jars for the packages stored in: /home/mpollock/.ivy2/jars
:: loading settings :: url = jar:file:<path to spark install>/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 480ms :: artifacts dl 0ms
:: modules in use:
-----------------------------------------
sparklyr
是 SparkR
的替代品还是建立在 SparkR
包之上?
是的,sparklyr 可用于 yarn 管理的集群。为了连接到 yarn 管理的集群,需要:
- 设置SPARK_HOME环境变量指向正确的spark主目录。
- 使用适当的主位置连接到 spark 集群,例如:
sc <- spark_connect(master = "yarn-client")
您可能正在使用 Cloudera Hadoop (CDH) 吗?
我问是因为我在使用 CDH 提供的 Spark 发行版时遇到了同样的问题:
Sys.getenv('SPARK_HOME')
[1] "/usr/lib/spark" # CDH-provided Spark
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /tmp/Rtmp6RwEnV/file307975dc1ea0.out
Ivy Default Cache set to: /home/oracle/.ivy2/cache
The jars for the packages stored in: /home/oracle/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.11;1.3.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
found com.
但是,在我从 Databricks(Spark 1.6.1、Hadoop 2.6)下载预构建版本并指向 SPARK_HOME
之后,我能够成功连接:
Sys.setenv(SPARK_HOME = '/home/oracle/spark-1.6.1-bin-hadoop2.6')
sc <- spark_connect(master = "yarn-client") # OK
library(dplyr)
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
[1] "iris"
Cloudera 尚未在其发行版中包含 SparkR
,我 怀疑 sparklyr
可能仍然对 SparkR
有一些微妙的依赖.以下是尝试使用 CDH 提供的 Spark 时的结果,但使用 config=list()
参数,如 Github 的 sparklyr
问题中的 this thread 所建议:
sc <- spark_connect(master='yarn-client', config=list()) # with CDH-provided Spark
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', sparkr-shell, /tmp/Rtmpi9KWFt/file22276cf51d90.out
Error: sparkr.zip does not exist for R application in YARN mode.
此外,如果您检查错误的 Parameters
部分的最右边部分(您的和我的),您将看到对 sparkr-shell
...[=24= 的引用]
(使用 sparklyr
0.2.28、sparkapi
0.3.15、来自 RStudio Server、Oracle Linux 的 R 会话进行测试=24=]
针对此问题,建议升级到 sparklyr
版本 0.2.30
或更高版本。使用 devtools::install_github("rstudio/sparklyr")
升级,然后重新启动 r 会话。
是的,它可以,但是已经写的所有其他内容都有一个问题,这在博客文献中非常难以捉摸,并且以配置资源为中心。
关键是这个:当你让它在本地模式下执行时,你不必声明性地配置资源,但是当你在 YARN 集群中执行,你绝对必须声明这些资源。我花了很长时间才找到对这个问题有所了解的文章,但是一旦我尝试了一下,它就奏效了。
这是一个(任意)示例,其中包含关键引用:
config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"
library(sparklyr)
Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')
是sparklyr
R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment文档。使用 Spark 附带的 SparkR
包,可以通过以下方式实现:
# set R environment variables
Sys.setenv(YARN_CONF_DIR=...)
Sys.setenv(SPARK_CONF_DIR=...)
Sys.setenv(LD_LIBRARY_PATH=...)
Sys.setenv(SPARKR_SUBMIT_ARGS=...)
spark_lib_dir <- ... # install specific
library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths()))
sc <- sparkR.init(master = "yarn-client")
然而,当我将上面的最后几行与
交换时library(sparklyr)
sc <- spark_connect(master = "yarn-client")
我收到错误:
Error in start_shell(scon, list(), jars, packages) :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/hdp/2.4.2.0-258/spark/bin/spark-submit
Parameters: '--packages' 'com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34' '--jars' '<path to R lib>/3.2/sparklyr/java/rspark_utils.jar' sparkr-shell /tmp/RtmpT31OQT/filecfb07d7f8bfd.out
Ivy Default Cache set to: /home/mpollock/.ivy2/cache
The jars for the packages stored in: /home/mpollock/.ivy2/jars
:: loading settings :: url = jar:file:<path to spark install>/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 480ms :: artifacts dl 0ms
:: modules in use:
-----------------------------------------
sparklyr
是 SparkR
的替代品还是建立在 SparkR
包之上?
是的,sparklyr 可用于 yarn 管理的集群。为了连接到 yarn 管理的集群,需要:
- 设置SPARK_HOME环境变量指向正确的spark主目录。
- 使用适当的主位置连接到 spark 集群,例如:
sc <- spark_connect(master = "yarn-client")
您可能正在使用 Cloudera Hadoop (CDH) 吗?
我问是因为我在使用 CDH 提供的 Spark 发行版时遇到了同样的问题:
Sys.getenv('SPARK_HOME')
[1] "/usr/lib/spark" # CDH-provided Spark
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /tmp/Rtmp6RwEnV/file307975dc1ea0.out
Ivy Default Cache set to: /home/oracle/.ivy2/cache
The jars for the packages stored in: /home/oracle/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.11;1.3.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
found com.
但是,在我从 Databricks(Spark 1.6.1、Hadoop 2.6)下载预构建版本并指向 SPARK_HOME
之后,我能够成功连接:
Sys.setenv(SPARK_HOME = '/home/oracle/spark-1.6.1-bin-hadoop2.6')
sc <- spark_connect(master = "yarn-client") # OK
library(dplyr)
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
[1] "iris"
Cloudera 尚未在其发行版中包含 SparkR
,我 怀疑 sparklyr
可能仍然对 SparkR
有一些微妙的依赖.以下是尝试使用 CDH 提供的 Spark 时的结果,但使用 config=list()
参数,如 Github 的 sparklyr
问题中的 this thread 所建议:
sc <- spark_connect(master='yarn-client', config=list()) # with CDH-provided Spark
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', sparkr-shell, /tmp/Rtmpi9KWFt/file22276cf51d90.out
Error: sparkr.zip does not exist for R application in YARN mode.
此外,如果您检查错误的 Parameters
部分的最右边部分(您的和我的),您将看到对 sparkr-shell
...[=24= 的引用]
(使用 sparklyr
0.2.28、sparkapi
0.3.15、来自 RStudio Server、Oracle Linux 的 R 会话进行测试=24=]
针对此问题,建议升级到 sparklyr
版本 0.2.30
或更高版本。使用 devtools::install_github("rstudio/sparklyr")
升级,然后重新启动 r 会话。
是的,它可以,但是已经写的所有其他内容都有一个问题,这在博客文献中非常难以捉摸,并且以配置资源为中心。
关键是这个:当你让它在本地模式下执行时,你不必声明性地配置资源,但是当你在 YARN 集群中执行,你绝对必须声明这些资源。我花了很长时间才找到对这个问题有所了解的文章,但是一旦我尝试了一下,它就奏效了。
这是一个(任意)示例,其中包含关键引用:
config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"
library(sparklyr)
Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')