"already computed partitions that can short-circuit the computation of a parent RDD" 是什么意思？

What's the meaning of "already computed partitions that can short-circuit the computation of a parent RDD"?

火花论文(http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf)如下图所示

不明白“"already computed partitions that can short-circuit the computation of a parent RDD"是什么意思你能给我解释一下并举一两个例子吗？

想象一下，您有一个 RDD，并且在其之上调用了 cache() 或 persist() 以将其保存在内存中。在此之后，您已经运行在此 RDD 之上进行了一些操作，这些操作导致了它的计算和缓存。但是：

RDD 可能太大而无法将其全部缓存，并且它的某些分区不会缓存在内存中。这样，在 Spark 控制台中，您会看到 RDD 的百分比持续存在
Spark 缓存是 LRU，这意味着 RDD 的某些分区可能会从内存中逐出，以防在此之后访问的另一个 RDD 需要内存。

这样在图 2.5 上您会看到一些分区被涂成黑色，这意味着它们已经被持久化并且不需要额外的计算来重新计算它们。有些 RDD 可能会被整体缓存，有些会被部分缓存。

这就是这句话告诉你的意思：The boundaries of the stages are the shuffle operations required for wide dependencies, or any already computed partitions that can short-circuit the computation of a parent RDD. The scheduler then launches tasks to compute missing partitions from each stage until it has computed the target RDD。这意味着如果某些 RDD 的分区已经计算完毕，它们将不会在您的调用中再次重新计算，并将作为调度的边界。图 2.5 显示 "Stage 1" 将被完全省略，因为它的计算结果已经缓存

"already computed partitions that can short-circuit the computation of a parent RDD" 是什么意思？

What's the meaning of "already computed partitions that can short-circuit the computation of a parent RDD"?

apache-spark

rdd

apache-spark-sql