GCP Dataproc spark.jar.packages 问题下载依赖项
GCP Dataproc spark.jar.packages issue downloading dependencies
在创建我们的 Dataproc Spark 集群时,我们通过
--properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6
到 gcloud dataproc clusters create
命令。
这是为了将我们的 PySpark 脚本保存到 CloudSQL
显然在创建时这不会做任何事情,但在第一个 spark-submit
这将尝试解决此依赖关系。
从技术上讲,它似乎解析并下载了必要的 jar 文件,但由于 spark-submit
发出的警告,集群上的第一个任务将失败
Exception in thread "main" java.lang.RuntimeException: [download failed: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1177)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:298)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
完整的输出是:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found mysql#mysql-connector-java;6.0.6 in central
downloading https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar ...
:: resolution report :: resolve 527ms :: artifacts dl 214ms
:: modules in use:
mysql#mysql-connector-java;6.0.6 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 1 | 1 | 0 || 1 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
[FAILED ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)
[FAILED ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)
==== central: tried
https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: FAILED DOWNLOADS ::
:: ^ see resolution messages for details ^ ::
::::::::::::::::::::::::::::::::::::::::::::::
:: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar
::::::::::::::::::::::::::::::::::::::::::::::
但是集群上的后续任务显示此输出
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found mysql#mysql-connector-java;6.0.6 in central
:: resolution report :: resolve 224ms :: artifacts dl 5ms
:: modules in use:
mysql#mysql-connector-java;6.0.6 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 1 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 1 already retrieved (0kB/7ms)
所以我的问题是:
- 这是什么原因?GCP 的好心人可以解决这个问题吗?
- 除了 运行 允许在集群启动时失败的虚拟任务之外,是否有临时解决方法?
你重现这个的一致性如何?在尝试使用不同的集群设置重现后,我最好的理论是这可能是服务器过载 returns 5xx 错误。
就解决方法而言:
1) 从 Maven Central 下载 jar 并在提交作业时使用 --jars
选项传递。如果您经常创建新集群,那么最好通过初始化操作将此文件暂存到集群上。
2) 通过 spark.jars.ivySettings
属性 提供备用 ivy 设置文件,指向 Google Maven Central 镜像(这应该 reduce/eliminate 出现 5xx 错误的几率)
查看这篇文章:
https://www.infoq.com/news/2015/11/maven-central-at-google
在创建我们的 Dataproc Spark 集群时,我们通过
--properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6
到 gcloud dataproc clusters create
命令。
这是为了将我们的 PySpark 脚本保存到 CloudSQL
显然在创建时这不会做任何事情,但在第一个 spark-submit
这将尝试解决此依赖关系。
从技术上讲,它似乎解析并下载了必要的 jar 文件,但由于 spark-submit
Exception in thread "main" java.lang.RuntimeException: [download failed: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1177)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:298)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
完整的输出是:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found mysql#mysql-connector-java;6.0.6 in central
downloading https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar ...
:: resolution report :: resolve 527ms :: artifacts dl 214ms
:: modules in use:
mysql#mysql-connector-java;6.0.6 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 1 | 1 | 0 || 1 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
[FAILED ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)
[FAILED ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)
==== central: tried
https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: FAILED DOWNLOADS ::
:: ^ see resolution messages for details ^ ::
::::::::::::::::::::::::::::::::::::::::::::::
:: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar
::::::::::::::::::::::::::::::::::::::::::::::
但是集群上的后续任务显示此输出
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found mysql#mysql-connector-java;6.0.6 in central
:: resolution report :: resolve 224ms :: artifacts dl 5ms
:: modules in use:
mysql#mysql-connector-java;6.0.6 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 1 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 1 already retrieved (0kB/7ms)
所以我的问题是:
- 这是什么原因?GCP 的好心人可以解决这个问题吗?
- 除了 运行 允许在集群启动时失败的虚拟任务之外,是否有临时解决方法?
你重现这个的一致性如何?在尝试使用不同的集群设置重现后,我最好的理论是这可能是服务器过载 returns 5xx 错误。
就解决方法而言:
1) 从 Maven Central 下载 jar 并在提交作业时使用 --jars
选项传递。如果您经常创建新集群,那么最好通过初始化操作将此文件暂存到集群上。
2) 通过 spark.jars.ivySettings
属性 提供备用 ivy 设置文件,指向 Google Maven Central 镜像(这应该 reduce/eliminate 出现 5xx 错误的几率)
查看这篇文章: https://www.infoq.com/news/2015/11/maven-central-at-google