Pyspark 是否默认缓存数据帧？

Question

如果我在 pyspark 中读取文件：

Data = spark.read(file.csv)

然后在 spark 会话的生命周期内，“数据”在内存中可用，对吗？因此，如果我调用 data.show() 5 次，它不会从磁盘读取 5 次。这是正确的吗？如果是，我为什么需要：

Data.cache()

Answer 1

If i read a file in pyspark: Data = spark.read(file.csv) Then for the life of the spark session, the ‘data’ is available in memory,correct?

没有。由于 Spark 惰性评估，这里没有任何反应，在您的情况下，这是在第一次调用 show() 时发生的。

So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct?

没有。每次调用 show 都会重新评估数据帧。缓存数据帧将阻止重新评估，从而强制从缓存中读取数据。

Pyspark caches dataframe by default or not?