如何将 spark 数据集转换为 scala seq

How to convert spark dataset to scala seq

我有以下情况class

case class Station(id: Long, name: String) extends Node

和站点的 Spark 数据集

vertices: org.apache.spark.sql.Dataset[Station] = [id: bigint, name: string]

我想将顶点数据集转换为 Seq[Station]。 我找到了很多关于如何从序列创建数据集的教程,但反之亦然。你有什么提示吗?

您可以使用 collect 将数据集转换为 Array。然后您可以自由转换为 Seq:

val verticesSeq: Seq[Station] = vertices.collect().toSeq

但谨慎使用:

Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.