如何使用 Pyspark 缓存增强数据帧

How to cache an augmented dataframe using Pyspark

我有一个大数据框，每次转换都会增加它，我需要优化执行时间。我的问题是在每次转换后制作一个 cache() ？

partitions=100
df = df.repartition(partitions, "uuid").cache()

df_aug = tran_1(df).cache()
df_aug = tran_2(df_aug).cache()
.
.
df_aug = tran_n(df_aug)

缓存不是提高性能的灵丹妙药 - 在您的场景中，它可能会减慢一切。当您多次访问同一数据集时（例如在数据探索中），使用缓存是个好主意。如果对单个数据集进行多次转换，将导致序列化和存储，而它只会被读取一次。

The data will be cached only after an action.
And you are performing cache after all transformations.
This is not required. You can use  cache after first transformation.
Then apply a small action. Then use cached data  in subsequent transformations.

df.cache()
df.count()
df_aug = tran_1(df)
df_aug = tran_2(df_aug)

This approach will be more optimized.