为什么默认的 persist() 会将数据作为未序列化对象存储在 JVM 堆中？

Why does the default persist() will store the data in the JVM heap as unserialized objects?

我正在学习 Apache Spark 并试图清除与 Spark 中 RDD 的缓存和持久性相关的概念。

所以根据书"Learning Spark"中持久性的文档：

To avoid computing an RDD multiple times, we can ask Spark to persist the data. When we ask Spark to persist an RDD, the nodes that compute the RDD store their partitions. Spark has many levels of persistence to choose from based on what our goals are.

In Scala and Java, the default persist() will store the data in the JVM heap as unserialized objects. In Python, we always serialize the data that persist stores, so the default is instead stored in the JVM heap as pickled objects. When we write data out to disk or off-heap storage, that data is also always serialized.

但是为什么-- 默认的 persist() 会将数据作为未序列化对象存储在 JVM 堆中.

因为没有序列化和反序列化开销，所以操作成本低，缓存数据无需额外内存即可加载。 SerDe 很昂贵并且显着增加了总体成本。并且保留序列化和反序列化的对象（特别是使用标准 Java 序列化）可以在最坏的情况下使内存使用量加倍。

为什么默认的 persist() 会将数据作为未序列化对象存储在 JVM 堆中？

Why does the default persist() will store the data in the JVM heap as unserialized objects?

apache-spark

caching

persistence

spark-streaming

rdd