在 Spark 中将 DataFrame 保存到 TFRecords 时出错
Error with Saving DataFrame to TFRecords in Spark
我正在尝试将数据帧保存到 spark-shell 中的 TFrecord 文件,这需要 spark-tensorflow-connector jar 的依赖,所以我 运行
spark-shell --jars xxx/xxx/spark-tensorflow-connector_2.11-1.11.0.jar
然后 运行 下面的代码在 spark-shell:
scala> import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
scala> val df = Seq((8, "bat"),(8, "abc"), (1, "xyz"), (2, "aaa")).toDF("number", "word")
df: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> df.show
+------+----+
|number|word|
+------+----+
| 8| bat|
| 8| abc|
| 1| xyz|
| 2| aaa|
+------+----+
scala> var s = df.write.mode(SaveMode.Overwrite).format("tfrecords").option("recordType", "Example")
s: org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row] = org.apache.spark.sql.DataFrameWriter@da1382f
scala> s.save("tmp/tfrecords")
java.lang.NoClassDefFoundError: scala/Product$class
at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.<init>(TensorflowRelation.scala:29)
at org.tensorflow.spark.datasources.tfrecords.DefaultSource.createRelation(DefaultSource.scala:78)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand(DataFrameWriter.scala:944)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269)
... 47 elided
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 70 more
Spark 版本为 3.0.0,使用 Scala 版本 2.12.10(Java HotSpot(TM) 64 位服务器虚拟机,Java 1.8.0_261)
问题是您使用的是用 Scala 2.11 编译的 Tensorflow 连接器(注意 jar 名称中的 _2.11
部分)和用 Scala 2.12 编译的 Spark 3.0。
截至目前,还没有针对 Spark 3.0 编译的 Tensorflow 连接器,因此您需要使用使用 Scala 2.11 编译的 Spark 2.4.6。
我正在尝试将数据帧保存到 spark-shell 中的 TFrecord 文件,这需要 spark-tensorflow-connector jar 的依赖,所以我 运行
spark-shell --jars xxx/xxx/spark-tensorflow-connector_2.11-1.11.0.jar
然后 运行 下面的代码在 spark-shell:
scala> import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
scala> val df = Seq((8, "bat"),(8, "abc"), (1, "xyz"), (2, "aaa")).toDF("number", "word")
df: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> df.show
+------+----+
|number|word|
+------+----+
| 8| bat|
| 8| abc|
| 1| xyz|
| 2| aaa|
+------+----+
scala> var s = df.write.mode(SaveMode.Overwrite).format("tfrecords").option("recordType", "Example")
s: org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row] = org.apache.spark.sql.DataFrameWriter@da1382f
scala> s.save("tmp/tfrecords")
java.lang.NoClassDefFoundError: scala/Product$class
at org.tensorflow.spark.datasources.tfrecords.TensorflowRelation.<init>(TensorflowRelation.scala:29)
at org.tensorflow.spark.datasources.tfrecords.DefaultSource.createRelation(DefaultSource.scala:78)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand(DataFrameWriter.scala:944)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269)
... 47 elided
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 70 more
Spark 版本为 3.0.0,使用 Scala 版本 2.12.10(Java HotSpot(TM) 64 位服务器虚拟机,Java 1.8.0_261)
问题是您使用的是用 Scala 2.11 编译的 Tensorflow 连接器(注意 jar 名称中的 _2.11
部分)和用 Scala 2.12 编译的 Spark 3.0。
截至目前,还没有针对 Spark 3.0 编译的 Tensorflow 连接器,因此您需要使用使用 Scala 2.11 编译的 Spark 2.4.6。