Spark Avro 在文件写入时抛出异常:NoSuchMethodError
Spark Avro throws exception on file write: NoSuchMethodError
Avro 格式的任何文件写入尝试均失败,堆栈跟踪如下。
我们使用 Spark 2.4.3(用户提供 Hadoop)、Scala 2.12,我们在运行时使用 spark-shell:
加载 Avro 包
spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.3
或火花提交:
spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.3 ...
spark Session 报告加载 Avro 包成功。
...在任何一种情况下,当我们尝试将任何数据写入 avro 格式时,例如:
df.write.format("avro").save("hdfs:///path/to/outputfile.avro")
或 select:
df.select("recordidstring").write.format("avro").save("hdfs:///path/to/outputfile.avro")
... 产生相同的堆栈跟踪错误(此副本来自 spark-shell):
java.lang.NoSuchMethodError: org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
at org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType(SchemaConverters.scala:176)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$prepareWrite(AvroFileFormat.scala:119)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
我们可以毫无困难地编写其他格式(文本分隔、json、ORC、parquet)。
我们使用 HDFS (Hadoop v3.1.2) 作为文件存储。
我尝试过不同的 Avro 包版本(例如 2.11,更低版本),它们要么引发相同的错误,要么由于不兼容而完全加载失败。所有 Python、Scala(使用 shell 或 spark-submit)和 Java(使用 spark-submit)都会出现此错误。
apache.org JIRA 上似乎有一个关于此的 Open Issue,但现在已经一年了,没有任何解决方案。我遇到了这个问题,但也想知道社区是否有解决办法?非常感谢任何帮助。
根据链接错误中的 a comment,您应该至少指定 1.8.0
版本的 avro,如下所示:
spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.3,org.apache.avro:avro:1.9.2 ...
(您可能也想尝试其他订单。)
兄弟,我遇到了和你一样的错误,但是我把我的spark版本更新到2.11 2.4.4,问题消失了。
此问题似乎特定于我们在本地集群上的配置 - HDFS 的单节点构建(本地 windows、其他 linux 等)允许 avro 正常写入。我们将重建问题集群,但我相信问题仅在该集群上配置错误 - 解决方案 - 重建。
我在最新的 Spark 上遇到了同样的异常。当我将以下依赖项添加到 pom 中时,它消失了。
<properties>
....
<spark.version>3.1.2</spark.version>
<avro.version>1.10.2</avro.version>
</properties>
<dependencies>
....
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>${avro.version}</version>
</dependency>
</dependencies>
看来您在启动应用程序的类路径中确实缺少必需的依赖项。
Avro 格式的任何文件写入尝试均失败,堆栈跟踪如下。
我们使用 Spark 2.4.3(用户提供 Hadoop)、Scala 2.12,我们在运行时使用 spark-shell:
加载 Avro 包spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.3
或火花提交:
spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.3 ...
spark Session 报告加载 Avro 包成功。
...在任何一种情况下,当我们尝试将任何数据写入 avro 格式时,例如:
df.write.format("avro").save("hdfs:///path/to/outputfile.avro")
或 select:
df.select("recordidstring").write.format("avro").save("hdfs:///path/to/outputfile.avro")
... 产生相同的堆栈跟踪错误(此副本来自 spark-shell):
java.lang.NoSuchMethodError: org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
at org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType(SchemaConverters.scala:176)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$prepareWrite(AvroFileFormat.scala:119)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
我们可以毫无困难地编写其他格式(文本分隔、json、ORC、parquet)。
我们使用 HDFS (Hadoop v3.1.2) 作为文件存储。
我尝试过不同的 Avro 包版本(例如 2.11,更低版本),它们要么引发相同的错误,要么由于不兼容而完全加载失败。所有 Python、Scala(使用 shell 或 spark-submit)和 Java(使用 spark-submit)都会出现此错误。
apache.org JIRA 上似乎有一个关于此的 Open Issue,但现在已经一年了,没有任何解决方案。我遇到了这个问题,但也想知道社区是否有解决办法?非常感谢任何帮助。
根据链接错误中的 a comment,您应该至少指定 1.8.0
版本的 avro,如下所示:
spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.3,org.apache.avro:avro:1.9.2 ...
(您可能也想尝试其他订单。)
兄弟,我遇到了和你一样的错误,但是我把我的spark版本更新到2.11 2.4.4,问题消失了。
此问题似乎特定于我们在本地集群上的配置 - HDFS 的单节点构建(本地 windows、其他 linux 等)允许 avro 正常写入。我们将重建问题集群,但我相信问题仅在该集群上配置错误 - 解决方案 - 重建。
我在最新的 Spark 上遇到了同样的异常。当我将以下依赖项添加到 pom 中时,它消失了。
<properties>
....
<spark.version>3.1.2</spark.version>
<avro.version>1.10.2</avro.version>
</properties>
<dependencies>
....
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>${avro.version}</version>
</dependency>
</dependencies>
看来您在启动应用程序的类路径中确实缺少必需的依赖项。