spark 中 memory_only 和 memory_and_disk 缓存级别有什么区别?
What is the difference between memory_only and memory_and_disk caching level in spark?
memory_only 和 memory_and_disk 缓存级别在 spark 中的行为有何不同?
文档说---
Storage Level
Meaning
MEMORY_ONLY
Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, some partitions will not be cached and will be
recomputed on the fly each time they're needed. This is the default
level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, store the partitions that don't fit on disk, and read
them from there when they're needed.
MEMORY_ONLY_SER
Store RDD as serialized Java objects (one byte array per partition).
This is generally more space-efficient than deserialized objects,
especially when using a fast serializer, but more CPU-intensive to
read.
MEMORY_AND_DISK_SER
Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in
memory to disk instead of recomputing them on the fly each time
they're needed.
DISK_ONLY
Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
Same as the levels above, but replicate each partition on two cluster
nodes.
OFF_HEAP (experimental)
Store RDD in serialized format in Tachyon. Compared to
MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and
allows executors to be smaller and to share a pool of memory, making
it attractive in environments with large heaps or multiple concurrent
applications. Furthermore, as the RDDs reside in Tachyon, the crash of
an executor does not lead to losing the in-memory cache. In this mode,
the memory in Tachyon is discardable. Thus, Tachyon does not attempt
to reconstruct a block that it evicts from memory.
这意味着仅用于内存,spark 将尝试始终将分区保留在内存中。如果某些分区不能保留在内存中,或者由于节点丢失而将某些分区从 RAM 中删除,spark 将使用沿袭信息重新计算。在内存和磁盘级别,spark 将始终保持分区计算和缓存。它会尝试保留在 RAM 中,但如果它不适合,那么分区将溢出到磁盘。
如 documentation 中所述,效率方面的持久性级别:
Level Space used CPU time In memory On disk Serialized
-------------------------------------------------------------------------
MEMORY_ONLY High Low Y N N
MEMORY_ONLY_SER Low High Y N Y
MEMORY_AND_DISK High Medium Some Some Some
MEMORY_AND_DISK_SER Low High Some Some Y
DISK_ONLY Low High N Y Y
MEMORY_AND_DISK
和 MEMORY_AND_DISK_SER
如果数据太多无法放入内存,则会溢出到磁盘。
memory_only 和 memory_and_disk 缓存级别在 spark 中的行为有何不同?
文档说---
Storage Level
Meaning
MEMORY_ONLY
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
MEMORY_ONLY_SER
Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER
Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
DISK_ONLY
Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
Same as the levels above, but replicate each partition on two cluster nodes.
OFF_HEAP (experimental)
Store RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory.
这意味着仅用于内存,spark 将尝试始终将分区保留在内存中。如果某些分区不能保留在内存中,或者由于节点丢失而将某些分区从 RAM 中删除,spark 将使用沿袭信息重新计算。在内存和磁盘级别,spark 将始终保持分区计算和缓存。它会尝试保留在 RAM 中,但如果它不适合,那么分区将溢出到磁盘。
如 documentation 中所述,效率方面的持久性级别:
Level Space used CPU time In memory On disk Serialized ------------------------------------------------------------------------- MEMORY_ONLY High Low Y N N MEMORY_ONLY_SER Low High Y N Y MEMORY_AND_DISK High Medium Some Some Some MEMORY_AND_DISK_SER Low High Some Some Y DISK_ONLY Low High N Y Y
MEMORY_AND_DISK
和 MEMORY_AND_DISK_SER
如果数据太多无法放入内存,则会溢出到磁盘。