如何将 spark 数据集转换为 scala seq
How to convert spark dataset to scala seq
我有以下情况class
case class Station(id: Long, name: String) extends Node
和站点的 Spark 数据集
vertices: org.apache.spark.sql.Dataset[Station] = [id: bigint, name: string]
我想将顶点数据集转换为 Seq[Station]。
我找到了很多关于如何从序列创建数据集的教程,但反之亦然。你有什么提示吗?
您可以使用 collect
将数据集转换为 Array
。然后您可以自由转换为 Seq
:
val verticesSeq: Seq[Station] = vertices.collect().toSeq
但谨慎使用:
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
我有以下情况class
case class Station(id: Long, name: String) extends Node
和站点的 Spark 数据集
vertices: org.apache.spark.sql.Dataset[Station] = [id: bigint, name: string]
我想将顶点数据集转换为 Seq[Station]。 我找到了很多关于如何从序列创建数据集的教程,但反之亦然。你有什么提示吗?
您可以使用 collect
将数据集转换为 Array
。然后您可以自由转换为 Seq
:
val verticesSeq: Seq[Station] = vertices.collect().toSeq
但谨慎使用:
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.