来自 spark-shell 的 Apache Hudi 示例为 Spark 2.3.0 抛出错误
Apache Hudi example from spark-shell throws error for Spark 2.3.0
我正在尝试 运行 这个例子 (https://hudi.apache.org/docs/quick-start-guide.html) 使用 spark-shell。 Apache Hudi 文档说“Hudi works with Spark-2.x versions”
环境详细信息是:
平台:HDP 2.6.5.0-292
星火版本:2.3.0.2.6.5.279-2
Scala 版本:2.11.8
我正在使用下面的 spark-shell 命令
(N.B。-spark-avro 版本不完全匹配,因为我找不到 Spark 2.3.2 的相应 spark-avro 依赖项)
spark-shell \
--packages org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.avro:avro:1.8.2 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
当我尝试写入数据时出现以下错误:
scala> df.write.format("hudi").
| options(getQuickstartWriteConfigs).
| option(PRECOMBINE_FIELD_OPT_KEY, "ts").
| option(RECORDKEY_FIELD_OPT_KEY, "uuid").
| option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
| option(TABLE_NAME, tableName).
| mode(Overwrite).
| save(basePath)
20/12/27 06:21:15 WARN HoodieSparkSqlWriter$: hoodie table at file:/u/users/j0s0j7j/tmp/hudi_trips_cow already exists. Deleting existing data & overwriting with new data.
java.lang.NoSuchMethodError: org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$$anonfun.apply(SchemaConverters.scala:176)
at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$$anonfun.apply(SchemaConverters.scala:174)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:77)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:132)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
... 68 elided
对我来说,正确的 avro 版本似乎没有添加到类路径中或未被提取。
任何人都可以建议解决方法吗?我现在被困在这个问题上已经有一段时间了。
此问题是因为从 spark2/jars/avro-1.7.7.jar 引用了 avro jar。这是导致错误的原因。
我必须使用 --jars、spark.driver.extraClassPath 和 spark.executor.extraClassPath 选项来指定 .ivy2/jars 位置以覆盖默认的 avro jar。
Spark Shell 命令:
spark-shell \
--packages org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.avro:avro:1.8.2 \
--jars $HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar \
--conf spark.driver.extraClassPath=$HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar \
--conf spark.executor.extraClassPath=$HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
使用下面的代码片段来打印 spark-shell
的类路径
import java.lang.ClassLoader
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
检查是否确实从 extraClassPath 选项中提取了 avro 文件
sc.getClass().getResource("/org/apache/avro/generic/GenericData.class")
res3: java.net.URL = jar:file:/users/joyan/.ivy2/jars/org.apache.avro_avro-1.8.2.jar!/org/apache/avro/generic/GenericData.class
我正在尝试 运行 这个例子 (https://hudi.apache.org/docs/quick-start-guide.html) 使用 spark-shell。 Apache Hudi 文档说“Hudi works with Spark-2.x versions” 环境详细信息是:
平台:HDP 2.6.5.0-292 星火版本:2.3.0.2.6.5.279-2 Scala 版本:2.11.8
我正在使用下面的 spark-shell 命令 (N.B。-spark-avro 版本不完全匹配,因为我找不到 Spark 2.3.2 的相应 spark-avro 依赖项)
spark-shell \
--packages org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.avro:avro:1.8.2 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
当我尝试写入数据时出现以下错误:
scala> df.write.format("hudi").
| options(getQuickstartWriteConfigs).
| option(PRECOMBINE_FIELD_OPT_KEY, "ts").
| option(RECORDKEY_FIELD_OPT_KEY, "uuid").
| option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
| option(TABLE_NAME, tableName).
| mode(Overwrite).
| save(basePath)
20/12/27 06:21:15 WARN HoodieSparkSqlWriter$: hoodie table at file:/u/users/j0s0j7j/tmp/hudi_trips_cow already exists. Deleting existing data & overwriting with new data.
java.lang.NoSuchMethodError: org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$$anonfun.apply(SchemaConverters.scala:176)
at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$$anonfun.apply(SchemaConverters.scala:174)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:77)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:132)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
... 68 elided
对我来说,正确的 avro 版本似乎没有添加到类路径中或未被提取。 任何人都可以建议解决方法吗?我现在被困在这个问题上已经有一段时间了。
此问题是因为从 spark2/jars/avro-1.7.7.jar 引用了 avro jar。这是导致错误的原因。
我必须使用 --jars、spark.driver.extraClassPath 和 spark.executor.extraClassPath 选项来指定 .ivy2/jars 位置以覆盖默认的 avro jar。
Spark Shell 命令:
spark-shell \
--packages org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.avro:avro:1.8.2 \
--jars $HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar \
--conf spark.driver.extraClassPath=$HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar \
--conf spark.executor.extraClassPath=$HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
使用下面的代码片段来打印 spark-shell
的类路径import java.lang.ClassLoader
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
检查是否确实从 extraClassPath 选项中提取了 avro 文件
sc.getClass().getResource("/org/apache/avro/generic/GenericData.class")
res3: java.net.URL = jar:file:/users/joyan/.ivy2/jars/org.apache.avro_avro-1.8.2.jar!/org/apache/avro/generic/GenericData.class