可以 运行 在 pyspark shell 中编写代码,但使用 spark-submit 提交时相同的代码会失败

Can run code in pyspark shell but the same code fails when submitted with spark-submit

正如您会在问题中注意到的那样,我是一名业余爱好者。我正在尝试 运行 火花集群上的非常基本的代码。 (在 dataproc 上创建)

  1. 我 SSH 进入 master

我在下面提供了一些基本细节。请务必让我知道我可能提供的任何其他详细信息以帮助我。

详情:

代码为 运行 :

testing_dep.py

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(100),numSlices=10).collect()
print(rdd)

运行 pyspark shell

pyspark --master yarn

输出:

Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-01-07 20:45:54,608 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using
 builtin-java classes where applicable
2022-01-07 20:45:58,195 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used
 because libhadoop cannot be loaded.
2022-01-07 20:45:58,357 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to up
loading libraries under SPARK_HOME.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0
      /_/
Using Python version 3.9.7 (default, Sep 29 2021 19:20:46)
Spark context Web UI available at http://pyspark32-m.us-central1-b.c.monsoon-credittech.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1641410203571_0040).
SparkSession available as 'spark'.
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(range(100),numSlices=10).collect()
>>> print(rdd)                                                                  
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88
, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

运行 火花提交

spark-submit --master yarn gs://monsoon-credittech.appspot.com
/testing_dep.py

输出:

2022-01-07 20:48:37,310 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using
 builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.had
oop.fs.gcs.GoogleHadoopFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
        at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:174)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
        at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1938)
        at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:780)
        at org.apache.spark.util.DependencyUtils$.downloadFile(DependencyUtils.scala:264)
        at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment(SparkSubmit.scala:376)
        at scala.Option.map(Option.scala:230)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
        at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:1043)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)
        ... 19 more

我认为错误信息很清楚:

Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

您需要将包含上述class的Jar文件添加到SPARK_CLASSPATH

请参阅 Issues Google Cloud Storage connector on SparkDataProc 完整的解决方案。