将 Pyspark Dataframe 写入 TFrecords 文件

Question

我有一个带有模式的数据框，我想将其转换为 tfRecords

root
 |-- col1: string (nullable = true)
 |-- col2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col3: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col4: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- col5: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- col6: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- col7: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col8: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col9: array (nullable = true)
 |    |-- element: string (containsNull = true)

我正在使用 spark tensorflow 连接器

df.write.mode("overwrite").format("tfrecords").option("recordType", "Example").save("targetpath.tf")

将数据保存到 tfrecords 时出现错误

java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps

我在databricks社区版也试过类似的方法，也有类似的错误

有人可以帮忙吗？

Answer 1

最可能的原因（根据 Maven Central information 判断）是您在使用 Scala 2.12 的 Databricks 运行时上使用了为 Scala 2.11 编译的连接器。

要么您需要使用 DBR 6.4 进行该转换，要么 compile connector for Scala 2.12 并使用。

将 Pyspark Dataframe 写入 TFrecords 文件

Writing Pyspark Dataframe to TFrecords file

apache-spark

pyspark

tensorflow

databricks