为什么coalesce会导致处理的节点太少？

Question

我正在尝试理解 spark 分区，在 blog 中我遇到了这段话

However, you should understand that you can drastically reduce the parallelism of your data processing — coalesce is often pushed up further in the chain of transformation and can lead to fewer nodes for your processing than you would like. To avoid this, you can pass shuffle = true. This will add a shuffle step, but it also means that the reshuffled partitions will be using full cluster resources if possible.

我理解 coalesce 意味着获取一些包含执行器的最少数据的数据，并通过散列分区程序将它们洗牌到现有的执行器。不过，我无法理解作者在这一段中想表达的意思。有人可以向我解释一下这段话的意思吗？

Answer 1

正如你在问题中所说的那样，“合并意味着获取一些包含执行者的最少数据的数据，并通过哈希从业者将它们洗牌到现有的执行者”。这实际上意味着以下

分区数量减少了
repartition 和 coalesce 之间的主要区别在于，coalesce 中跨分区的数据移动比 repartition 中的少，因此降低了 shuffle 的级别，从而提高了效率。
添加属性 shuffle=true 只是为了在节点之间均匀分布数据，这与使用repartition() 相同。如果您觉得在执行合并后您的数据在节点中可能会倾斜，您可以使用 shuffle=true。

希望这能回答您的问题

Answer 2

Coalesce has some not so obvious effects due to Spark Catalyst.

E.g.

Let’s say you had a parallelism of 1000, but you only wanted to write 10 files at the end. You might think you could do:
load().map(…).filter(…).coalesce(10).save()
However, Spark’s will effectively push down the coalesce operation to as early a point as possible, so this will execute as:
load().coalesce(10).map(…).filter(…).save()

您可以在这里详细阅读一篇 优秀的 文章，我引用了这篇文章，我之前偶然发现了这篇文章：https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908

总结：coalesce 的 Catalyst 处理可以降低管道早期的并发性。 我认为这就是所暗示的，尽管当然每种情况都是不同的，并且 JOIN 和聚合通常不受此类影响，因为 200 个默认分区适用于此类 Spark 操作.

为什么coalesce会导致处理的节点太少？

Why can coalesce lead to too few nodes for processing?

coalesce

apache-spark

rdd

apache-spark-sql

pyspark