即使存在 JAR 文件,也无法在 pyspark 中实例化 GoogleHadoopFileSystem
Cannot instantiate GoogleHadoopFileSystem in pyspark even when JAR files are present
相同的代码在 Linux Ubuntu 上使用相同的 jar files.My spark 是 3.1.2,hadoop 是 3.2。我尝试了 Maven 的每个 gcs 连接器版本。
val = df.write.format('bigquery') \ #df is a spark.dataframe
.mode(mode) \
.option("credentialsFile", "creds.json") \
.option('table', table) \
.option("temporaryGcsBucket", bucket) \
.save()
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/09/17 07:41:50 WARN FileSystem: Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem could not be instantiated
21/09/17 07:41:50 WARN FileSystem: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkState(ZLjava/lang/String;J)V
21/09/17 07:41:50 WARN FileSystem: Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem could not be instantiated
21/09/17 07:41:50 WARN FileSystem: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
Traceback (most recent call last):
File "c:\sparktest\main.py", line 158, in <module>
val = df.write.format('bigquery') \
File "c:\sparktest\vnenv\lib\site-packages\pyspark\sql\readwriter.py", line 828, in save
self._jwrite.save()
File "c:\sparktest\vnenv\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "c:\sparktest\vnenv\lib\site-packages\pyspark\sql\utils.py", line 128, in deco
return f(*a, **kw)
File "c:\sparktest\vnenv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o50.save.
: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark jars 文件夹中的旧 jars 未正确实例化。我不得不删除它们并从 Maven 仓库中添加新的。下面是我使用的代码。
spark = SparkSession \
.builder \
.appName(appName) \
.config(conf=spark_conf) \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0') \
.getOrCreate()
相同的代码在 Linux Ubuntu 上使用相同的 jar files.My spark 是 3.1.2,hadoop 是 3.2。我尝试了 Maven 的每个 gcs 连接器版本。
val = df.write.format('bigquery') \ #df is a spark.dataframe
.mode(mode) \
.option("credentialsFile", "creds.json") \
.option('table', table) \
.option("temporaryGcsBucket", bucket) \
.save()
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/09/17 07:41:50 WARN FileSystem: Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem could not be instantiated
21/09/17 07:41:50 WARN FileSystem: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkState(ZLjava/lang/String;J)V
21/09/17 07:41:50 WARN FileSystem: Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem could not be instantiated
21/09/17 07:41:50 WARN FileSystem: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
Traceback (most recent call last):
File "c:\sparktest\main.py", line 158, in <module>
val = df.write.format('bigquery') \
File "c:\sparktest\vnenv\lib\site-packages\pyspark\sql\readwriter.py", line 828, in save
self._jwrite.save()
File "c:\sparktest\vnenv\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "c:\sparktest\vnenv\lib\site-packages\pyspark\sql\utils.py", line 128, in deco
return f(*a, **kw)
File "c:\sparktest\vnenv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o50.save.
: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark jars 文件夹中的旧 jars 未正确实例化。我不得不删除它们并从 Maven 仓库中添加新的。下面是我使用的代码。
spark = SparkSession \
.builder \
.appName(appName) \
.config(conf=spark_conf) \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0') \
.getOrCreate()