spark.master 在独立集群中通过 REST 作业提交的配置被忽略
spark.master configuration via REST job submission in standalone cluster is ignored
我在 HA 模式下有一个独立的 spark 集群(2 个主节点),并且有几个工人在那里注册。
我通过 REST 接口提交了包含以下详细信息的 spark 作业,
{
"sparkProperties": {
"spark.app.name": "TeraGen3",
"spark.default.parallelism": "40",
"spark.executor.memory": "512m",
"spark.driver.memory": "512m",
"spark.task.maxFailures": "3",
"spark.jars": "file:///tmp//test//spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar",
"spark.eventLog.enabled": "false",
"spark.submit.deployMode": "cluster",
"spark.driver.supervise": "true",
"spark.master": "spark://spark-hn0:7077,spark-hn1:7077"
},
"mainClass": "com.github.ehiggs.spark.terasort.TeraGen",
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"action": "CreateSubmissionRequest",
"appArgs": ["4g", "file:///tmp/data/teradata4g/"],
"appResource": "file:///tmp//test//spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar",
"clientSparkVersion": "2.1.1"
}
此请求通过 REST 接口提交给 Active Spark Master(http://spark-hn1:6066/v1/submissions/create)。
启动驱动程序时,-Dspark.master 设置为 "spark://spark-hn1:7077" 而不是 sparkProperties 中传递的值 "spark://spark-hn0:7077,spark-hn1:7077"。
来自驱动程序 运行
的工作节点的日志
17/12/18 13:29:49 INFO worker.DriverRunner: Launch Command: "/usr/lib/jvm/java-8-openjdk-amd64/bin/java" "-Dhdp.version=2.6.99.200-0" "-cp" "/usr/hdp/current/spark2-client/conf/:/usr/hdp/current/spark2-client/jars/*:/etc/hadoop/conf/" "-Xmx512M" "-Dspark.driver.memory=51
2m" "-Dspark.master=spark://spark-hn1:7077" "-Dspark.executor.memory=512m" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=TeraGen3" "-Dspark.default.parallelism=40" "-Dspark.jars=file:///tmp//test//spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar" "-Dspark.ta
sk.maxFailures=3" "-Dspark.driver.supervise=true" "-Dspark.eventLog.enabled=false" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@172.18.0.4:40803" "/var/spark/work/driver-20171218132949-0001/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar" "com.git
hub.ehiggs.spark.terasort.TeraGen" "4g" "file:///tmp/data/teradata4g/"
当活动主机在作业执行期间关闭而另一个主机变为活动状态时,这给我带来了问题。由于驱动程序只知道一个主机(旧主机),因此无法联系到新主机并继续执行作业(因为 spark.driver.supervise=true)
在 Spark REST 接口中传递多个主 url 的正确方法是什么。
看起来这是 RestServer 实现中的一个错误,其中 spark.master 被替换了。
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L147
我们仍然可以通过在 spark.driver.extraJavaOptions 中设置 spark.master 来解决这个问题,同时通过 REST 接口提交作业,如下所示
"sparkProperties": {
"spark.app.name": "TeraGen3",
...
"spark.driver.extraJavaOptions": "-Dspark.master=spark://spark-hn0:7077,spark-hn1:7077"
}
这对我有用。
我在 HA 模式下有一个独立的 spark 集群(2 个主节点),并且有几个工人在那里注册。
我通过 REST 接口提交了包含以下详细信息的 spark 作业,
{
"sparkProperties": {
"spark.app.name": "TeraGen3",
"spark.default.parallelism": "40",
"spark.executor.memory": "512m",
"spark.driver.memory": "512m",
"spark.task.maxFailures": "3",
"spark.jars": "file:///tmp//test//spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar",
"spark.eventLog.enabled": "false",
"spark.submit.deployMode": "cluster",
"spark.driver.supervise": "true",
"spark.master": "spark://spark-hn0:7077,spark-hn1:7077"
},
"mainClass": "com.github.ehiggs.spark.terasort.TeraGen",
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"action": "CreateSubmissionRequest",
"appArgs": ["4g", "file:///tmp/data/teradata4g/"],
"appResource": "file:///tmp//test//spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar",
"clientSparkVersion": "2.1.1"
}
此请求通过 REST 接口提交给 Active Spark Master(http://spark-hn1:6066/v1/submissions/create)。
启动驱动程序时,-Dspark.master 设置为 "spark://spark-hn1:7077" 而不是 sparkProperties 中传递的值 "spark://spark-hn0:7077,spark-hn1:7077"。
来自驱动程序 运行
的工作节点的日志17/12/18 13:29:49 INFO worker.DriverRunner: Launch Command: "/usr/lib/jvm/java-8-openjdk-amd64/bin/java" "-Dhdp.version=2.6.99.200-0" "-cp" "/usr/hdp/current/spark2-client/conf/:/usr/hdp/current/spark2-client/jars/*:/etc/hadoop/conf/" "-Xmx512M" "-Dspark.driver.memory=51
2m" "-Dspark.master=spark://spark-hn1:7077" "-Dspark.executor.memory=512m" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=TeraGen3" "-Dspark.default.parallelism=40" "-Dspark.jars=file:///tmp//test//spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar" "-Dspark.ta
sk.maxFailures=3" "-Dspark.driver.supervise=true" "-Dspark.eventLog.enabled=false" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@172.18.0.4:40803" "/var/spark/work/driver-20171218132949-0001/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar" "com.git
hub.ehiggs.spark.terasort.TeraGen" "4g" "file:///tmp/data/teradata4g/"
当活动主机在作业执行期间关闭而另一个主机变为活动状态时,这给我带来了问题。由于驱动程序只知道一个主机(旧主机),因此无法联系到新主机并继续执行作业(因为 spark.driver.supervise=true)
在 Spark REST 接口中传递多个主 url 的正确方法是什么。
看起来这是 RestServer 实现中的一个错误,其中 spark.master 被替换了。 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L147
我们仍然可以通过在 spark.driver.extraJavaOptions 中设置 spark.master 来解决这个问题,同时通过 REST 接口提交作业,如下所示
"sparkProperties": {
"spark.app.name": "TeraGen3",
...
"spark.driver.extraJavaOptions": "-Dspark.master=spark://spark-hn0:7077,spark-hn1:7077"
}
这对我有用。