通过 spark-submit 在 Kubernetes 中找不到 Uber jar
Uber jar not found in Kubernetes via spark-submit
我有一个非常简单的 Spark 作业,但我无法让它在 Kubernetes 中运行。我得到的错误是:
> 19/10/03 14:59:51 WARN DependencyUtils: Local jar /opt/spark/work-dir/target/scala-2.11/ScalaTest-assembly-1.0.jar does
> not exist, skipping.
> 19/10/03 14:59:51 WARN SparkSubmit$$anon: Failed to load ScalaTest.
> java.lang.ClassNotFoundException: ScalaTest
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
> at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:806)
> at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:161)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:920)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
项目结构:
project/build.properties
project/plugins.sbt
src/main/scala/ScalaTest.scala
Dockerfile
build.sbt
build.properties
sbt.version=1.2.8
plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.10.0-RC1")
ScalaTest.scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
object ScalaTest {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("ScalaTest").config("spark.master", "local[*]").getOrCreate()
import spark.implicits._
println("hello")
}
}
Dockerfile
这只是一个基于从 Spark 二进制文件中的 kubernetes 文件夹构建的包装图像。在构建此图像之前,我确保我 运行 sbt assembly
生成了 Uber jar。
FROM spark:latest
WORKDIR /opt/spark/work-dir
COPY target/scala-2.11/ScalaTest-assembly-1.0.jar target/scala-2.11/ScalaTest-assembly-1.0.jar
build.sbt
name := "ScalaTest"
version := "1.0"
scalaVersion := "2.11.12"
val sparkVersion = "2.4.4"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % sparkVersion % "provided",
"org.apache.spark" % "spark-sql_2.11" % sparkVersion % "provided"
)
最后是我的spark-submit
。在执行此操作之前,我将映像推送到 ECR 的注册表,以便 EKS 可以提取该映像。我还指出了 uber jar 在 我的图像中的位置。
~/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \
--master k8s://{K8S_ENDPOINT}:443 \
--deploy-mode cluster \
--name test-job \
--conf spark.kubernetes.container.image={ECR_IMAGE}:latest \
--conf spark.kubernetes.submission.waitAppCompletion=false \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.pod.name=test-job \
--class ScalaTest \
local:///opt/spark/work-dir/target/scala-2.11/ScalaTest-assembly-1.0.jar
另请注意,当我 运行 下面的命令(spark-submit 在我的容器中,本地)时,它按预期工作:
docker run --rm -it my-custom-image ../bin/spark-submit target/scala-2.11/ScalaTest-assembly-1.0.jar
更新
检查组装好的 uber jar,我可以看到 ScalaTest 的 class 在那里。
jar tf target/scala-2.11/ScalaTest-assembly-1.0.jar
...
ScalaTest$.class
ScalaTest.class
...
这个问题的解决方案是不将 jar 保留在工作目录中,而是保留在 jars 文件夹中。我没有查看文档,但可能这是一个可以更改的环境变量。无论如何,Dockerfile 应该如下所示:
FROM spark:latest
COPY target/scala-2.11/ScalaTest-assembly-1.0.jar /ops/spark/jars/ScalaTest-assembly-1.0.jar
然后相应地更改 spark-submit
:
~/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \
--master k8s://{K8S_ENDPOINT}:443 \
--deploy-mode cluster \
--name test-job \
--conf spark.kubernetes.container.image={ECR_IMAGE}:latest \
--conf spark.kubernetes.submission.waitAppCompletion=false \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.pod.name=test-job \
--class ScalaTest \
local:///opt/spark/jars/ScalaTest-assembly-1.0.jar
我有一个非常简单的 Spark 作业,但我无法让它在 Kubernetes 中运行。我得到的错误是:
> 19/10/03 14:59:51 WARN DependencyUtils: Local jar /opt/spark/work-dir/target/scala-2.11/ScalaTest-assembly-1.0.jar does
> not exist, skipping.
> 19/10/03 14:59:51 WARN SparkSubmit$$anon: Failed to load ScalaTest.
> java.lang.ClassNotFoundException: ScalaTest
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
> at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:806)
> at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:161)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:920)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
项目结构:
project/build.properties
project/plugins.sbt
src/main/scala/ScalaTest.scala
Dockerfile
build.sbt
build.properties
sbt.version=1.2.8
plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.10.0-RC1")
ScalaTest.scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
object ScalaTest {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("ScalaTest").config("spark.master", "local[*]").getOrCreate()
import spark.implicits._
println("hello")
}
}
Dockerfile
这只是一个基于从 Spark 二进制文件中的 kubernetes 文件夹构建的包装图像。在构建此图像之前,我确保我 运行 sbt assembly
生成了 Uber jar。
FROM spark:latest
WORKDIR /opt/spark/work-dir
COPY target/scala-2.11/ScalaTest-assembly-1.0.jar target/scala-2.11/ScalaTest-assembly-1.0.jar
build.sbt
name := "ScalaTest"
version := "1.0"
scalaVersion := "2.11.12"
val sparkVersion = "2.4.4"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % sparkVersion % "provided",
"org.apache.spark" % "spark-sql_2.11" % sparkVersion % "provided"
)
最后是我的spark-submit
。在执行此操作之前,我将映像推送到 ECR 的注册表,以便 EKS 可以提取该映像。我还指出了 uber jar 在 我的图像中的位置。
~/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \
--master k8s://{K8S_ENDPOINT}:443 \
--deploy-mode cluster \
--name test-job \
--conf spark.kubernetes.container.image={ECR_IMAGE}:latest \
--conf spark.kubernetes.submission.waitAppCompletion=false \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.pod.name=test-job \
--class ScalaTest \
local:///opt/spark/work-dir/target/scala-2.11/ScalaTest-assembly-1.0.jar
另请注意,当我 运行 下面的命令(spark-submit 在我的容器中,本地)时,它按预期工作:
docker run --rm -it my-custom-image ../bin/spark-submit target/scala-2.11/ScalaTest-assembly-1.0.jar
更新 检查组装好的 uber jar,我可以看到 ScalaTest 的 class 在那里。
jar tf target/scala-2.11/ScalaTest-assembly-1.0.jar
...
ScalaTest$.class
ScalaTest.class
...
这个问题的解决方案是不将 jar 保留在工作目录中,而是保留在 jars 文件夹中。我没有查看文档,但可能这是一个可以更改的环境变量。无论如何,Dockerfile 应该如下所示:
FROM spark:latest
COPY target/scala-2.11/ScalaTest-assembly-1.0.jar /ops/spark/jars/ScalaTest-assembly-1.0.jar
然后相应地更改 spark-submit
:
~/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \
--master k8s://{K8S_ENDPOINT}:443 \
--deploy-mode cluster \
--name test-job \
--conf spark.kubernetes.container.image={ECR_IMAGE}:latest \
--conf spark.kubernetes.submission.waitAppCompletion=false \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.pod.name=test-job \
--class ScalaTest \
local:///opt/spark/jars/ScalaTest-assembly-1.0.jar