从 DataFrame 按分区收集集合

Question

我有按列分区的 DataFrame：

val dfDL = spark.read.option("delimiter", ",")
                     .option("header", true)
                     .csv(file.getPath.toUri.getPath)
                     .repartition(col("column_to"))

val structure =  "schema_from" ::
                 "table_from"  ::
                 "column_from" ::
                 "link_type"   ::
                 "schema_to"   ::
                 "table_to"    ::
                 "column_to"   :: Nil

如何按分区获取数组集合？也就是说，对于每个分区我都需要一个集合。例如我需要这个方法：

def getArrays(df: DataFrame): Iterator[Array] = { //Or Iterator[List]
   ???
}

分区的所有值：

val allTargetCol = df.select(col("column_to")).distinct().collect().map(_.getString(0))

Answer 1

如果您知道分区值，则可以遍历每个分区值，调用过滤器然后收集。

伪代码

partitions = []

for partition_value in partition_values_list:
  partitions.append(df.filter(f.col('partiton_column') == partition_value).collect())

否则，您需要先制作一个list/array不同的分区值，然后重复上述步骤。

从 DataFrame 按分区收集集合

Collect collections by partitions from DataFrame

collections

scala

partitioning

dataframe

apache-spark