Pyspark 数据帧中的 Cache()

Question

我有一个数据框，我需要在其上包含几个转换。我想在同一个数据框中执行所有操作。因此，如果我需要使用缓存，我应该在执行每个操作后缓存数据帧吗？

df=df.selectExpr("*","explode(area)").select("*","col.*").drop(*['col','area'])
df.cache()
df=df.withColumn('full_name',f.concat(f.col('first_name'),f.lit(' '),f.col('last_name'))).drop('first_name','last_name')
df.cache()
df=df.withColumn("cleaned_map", regexp_replace("date", "[^0-9T]", "")).withColumn("date_type", to_date("cleaned_map", "ddMMyyyy")).drop('date','cleaned_map')
df.cache()
df=df.filter(df.date_type.isNotNull())
df.show()

我应该这样添加还是缓存一次就够了？

我还想知道如果我在上面的代码中使用多个数据帧而不是一个数据帧，我是否应该在每次转换时都包含缓存。非常感谢！

Answer 1

答案很简单，当您执行 df = df.cache() 或 df.cache() 时，两者都会在粒度级别定位到 RDD。现在，一旦你执行任何操作，它就会创建一个新的 RDD，所以这很明显不会被缓存，所以说这取决于你 DF/RDD 你想要 cache()。另外，尽量避免尝试不必要的缓存，因为数据将保留在内存中。

下面是来自 spark documentation

的 cache() 的源代码

def cache(self): 
    """ 
    Persist this RDD with the default storage level (C{MEMORY_ONLY_SER}). 
    """ 
    self.is_cached = True 
    self.persist(StorageLevel.MEMORY_ONLY_SER) 
    return self

Pyspark 数据帧中的 Cache()

Cache() in Pyspark Dataframe

apache-spark

apache-spark-sql

pyspark

pyspark-dataframes