Spark-1.6.0+：spark.shuffle.memoryFraction 已弃用 - 什么时候会发生溢出？

Spark-1.6.0+: spark.shuffle.memoryFraction deprecated - When will spill happen?

基于 Spark 的最新版本，shuffle behavior 发生了很大变化。

问题: SparkUI 已停止显示 spill 是否发生（以及发生了多少）。在我的一个实验中，我试图模拟这样一种情况，即执行程序上的随机写入会超过 “JVM Heap Size” * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction（基于 article），但没有看到任何相关的磁盘溢出日志。有没有办法获取这些信息？

PS：如果这听起来像是理论上的问题，请原谅。

With Spark 1.6.0，内存管理系统已更新。简而言之，不再有专用的 cache/shuffle 内存。所有内存均可用于任一操作。来自发行说明

Automatic memory management: Another area of performance gains in Spark 1.6 comes from better memory management. Before Spark 1.6, Spark statically divided the available memory into two regions: execution memory and cache memory. Execution memory is the region that is used in sorting, hashing, and shuffling, while cache memory is used to cache hot data. Spark 1.6 introduces a new memory manager that automatically tunes the size of different memory regions. The runtime automatically grows and shrinks regions according to the needs of the executing application. For many applications, this will mean a significant increase in available memory that can be used for operators such as joins and aggregations, without any user tuning.

This jira ticket gives background reasoning for the change and this paper 深入讨论了新的内存管理系统。

Spark-1.6.0+：spark.shuffle.memoryFraction 已弃用 - 什么时候会发生溢出？

Spark-1.6.0+: spark.shuffle.memoryFraction deprecated - When will spill happen?

performance

shuffle

apache-spark