Spark Web UI,显示非零内存存储数字,即使我不保留数据

Spark Web UI, shows non-zero Memory Storage numbers even if I don't persist data

我的 spark 应用程序在内存存储中显示非零数量,即使我不使用持久化或缓存也是如此。即使我不使用 persist/cache ,spark 也会缓存我的数据吗?

Spark 优化可能正在尝试 "broadcast" 将较小的数据集分配给每个工作人员以节省网络使用量。

引自Scaladocs

A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

此外,内存改组会占用您的 RAM。

引自Medium

Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.