Spark 中的分区和分桶有什么区别？

Question

我尝试优化两个 spark 数据帧之间的连接查询，我们称它们为 df1、df2（在公共列 "SaleId" 上连接）。 df1 非常小（5M）所以我在 spark 集群的节点之间广播它。 df2 非常大（200M 行）所以我尝试通过 "SaleId".

bucket/repartition 它

在Spark中，按列分区数据和按列分桶数据有什么区别？

例如：

分区：

df2 = df2.repartition(10, "SaleId")

桶：

df2.write.format('parquet').bucketBy(10, 'SaleId').mode("overwrite").saveAsTable('bucketed_table'))

在每一种技巧之后，我都将 df2 加入了 df1。

我不知道哪一个是正确的技术。谢谢

Answer 1

repartition is for using as part of an Action in the same Spark Job.

bucketBy is for output, write. And thus for avoiding shuffling in the next Spark App, typically as part of ETL. Think of JOINs. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4861715144695760/2994977456373837/5701837197372837/latest.html which is an excellent concise read. bucketBy tables can only be read by Spark though currently.

Spark 中的分区和分桶有什么区别？

What is the difference between partitioning and bucketing in Spark?

python

bucket

data-partitioning

apache-spark