Spark 3 中的自适应查询执行

Adaptive Query Execution in Spark 3

我刚刚了解了 Spark 3.0 引入的新自适应查询执行 (AQE)。

不过有一点我觉得很奇怪。对于以下切换连接策略的示例：

在 AQE 决定切换到广播模式之前，阶段 1 和阶段 2 已经完全完成（包括地图边洗牌）。

我的问题：因为这两个数据集已经写入磁盘以进行混洗（map side shuffle），所以切换到广播是否为时已晚？在大多数情况下，此切换是否会比继续减少侧洗牌更有效？我想是的，因为 Databricks 的人已经做出了这个选择，但我想确保我没有错过任何东西..

因为两个数据集已经写入磁盘以进行混洗（map side shuffle），那么切换到广播是否为时已晚？ - 完全有效的担忧，但是“迟到总比不到好，对吧？ ;-) Spark Performance Tuning 提及：

...This is not as efficient as planning a broadcast hash join in the first place, but it’s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic (if spark.sql.adaptive.localShuffleReader.enabled is true)

spark.sql.adaptive.localShuffleReader.enabled Spark 3.0 中添加了运行时配置，默认设置为 true。

我也认为一旦执行方广播 SPARK-17556 出现，它可以帮助/建立在它的基础上。

Spark 3 中的自适应查询执行

Adaptive Query Execution in Spark 3

optimization

performance

shuffle

apache-spark

apache-spark-sql