Spark 将相关数据划分为行组

Question

使用 Apache Spark 我们可以在保存为 Parquet 格式时将数据帧分割成单独的文件。

在 the way Parquet files are written 中，每个分区包含多个行组，每个行组都包含与每个组相关的列统计信息（例如，min/max 个值，以及 NULL 个值的数量）。

现在，在某些情况下组织 Parquet 文件使得相关数据一起出现在一个或多个行组中似乎是理想的。这将是每个分区文件中的二级分区（构成第一级）。

这可以使用例如 pyarrow，但是我们如何使用分布式 SQL 引擎（例如 Spark）来做到这一点？

Answer 1

除了分区之外，您还可以对数据进行排序，将相关数据分组到一组有限的分区中。来自 Databricks 的声明：

Z-Ordering is a technique to colocate related information in the same set of files

(
    df
    .write.option("header", True)
    .orderBy(df.col_1.desc())
    .partitionBy("col_2")
)

Spark partitioning of related data into row groups