可以 运行 在 pyspark shell 中编写代码,但使用 spark-submit 提交时相同的代码会失败
Can run code in pyspark shell but the same code fails when submitted with spark-submit
正如您会在问题中注意到的那样,我是一名业余爱好者。我正在尝试 运行 火花集群上的非常基本的代码。 (在 dataproc 上创建)
- 我 SSH 进入 master
使用 pyspark --master yarn
和 运行 代码创建 pyspark shell - 成功
运行 与 spark-submit --master yarn code.py
完全相同的代码 - 失败
我在下面提供了一些基本细节。请务必让我知道我可能提供的任何其他详细信息以帮助我。
详情:
代码为 运行 :
testing_dep.py
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(100),numSlices=10).collect()
print(rdd)
运行 pyspark shell
pyspark --master yarn
输出:
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-01-07 20:45:54,608 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
2022-01-07 20:45:58,195 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used
because libhadoop cannot be loaded.
2022-01-07 20:45:58,357 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to up
loading libraries under SPARK_HOME.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.0
/_/
Using Python version 3.9.7 (default, Sep 29 2021 19:20:46)
Spark context Web UI available at http://pyspark32-m.us-central1-b.c.monsoon-credittech.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1641410203571_0040).
SparkSession available as 'spark'.
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(range(100),numSlices=10).collect()
>>> print(rdd)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88
, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
运行 火花提交
spark-submit --master yarn gs://monsoon-credittech.appspot.com
/testing_dep.py
输出:
2022-01-07 20:48:37,310 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.had
oop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1938)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:780)
at org.apache.spark.util.DependencyUtils$.downloadFile(DependencyUtils.scala:264)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment(SparkSubmit.scala:376)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)
... 19 more
我认为错误信息很清楚:
Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
您需要将包含上述class的Jar文件添加到SPARK_CLASSPATH
请参阅 Issues Google Cloud Storage connector on Spark 或
DataProc 完整的解决方案。
正如您会在问题中注意到的那样,我是一名业余爱好者。我正在尝试 运行 火花集群上的非常基本的代码。 (在 dataproc 上创建)
- 我 SSH 进入 master
使用
pyspark --master yarn
和 运行 代码创建 pyspark shell - 成功运行 与
spark-submit --master yarn code.py
完全相同的代码 - 失败
我在下面提供了一些基本细节。请务必让我知道我可能提供的任何其他详细信息以帮助我。
详情:
代码为 运行 :
testing_dep.py
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(100),numSlices=10).collect()
print(rdd)
运行 pyspark shell
pyspark --master yarn
输出:
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-01-07 20:45:54,608 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
2022-01-07 20:45:58,195 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used
because libhadoop cannot be loaded.
2022-01-07 20:45:58,357 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to up
loading libraries under SPARK_HOME.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.0
/_/
Using Python version 3.9.7 (default, Sep 29 2021 19:20:46)
Spark context Web UI available at http://pyspark32-m.us-central1-b.c.monsoon-credittech.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1641410203571_0040).
SparkSession available as 'spark'.
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(range(100),numSlices=10).collect()
>>> print(rdd)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88
, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
运行 火花提交
spark-submit --master yarn gs://monsoon-credittech.appspot.com
/testing_dep.py
输出:
2022-01-07 20:48:37,310 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.had
oop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1938)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:780)
at org.apache.spark.util.DependencyUtils$.downloadFile(DependencyUtils.scala:264)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment(SparkSubmit.scala:376)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)
... 19 more
我认为错误信息很清楚:
Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
您需要将包含上述class的Jar文件添加到SPARK_CLASSPATH
请参阅 Issues Google Cloud Storage connector on Spark 或 DataProc 完整的解决方案。