Spark 中数据帧操作的时间复杂度和内存占用是多少？

What is the time complexity and memory footprint of dataframe operations in Spark?

time-complexity
memory-consumption
space-complexity
apache-spark-sql

Spark 中数据帧操作的算法复杂度是多少and/or？我在文档中找不到任何信息。

一个有用的例子是用另一列（withColumn()）扩展数据帧时对 memory/disk 足迹的回答：（在 Python 中使用自动垃圾收集）是否更好做 table = table.withColumn(…) 还是 extended_table = table.withColumn() 占用相同的内存？

PS：假设两个表都用 persist().

持久化

分配给同一个变量或另一个变量没有区别。 Spark 只是使用这些分配从您指定的操作构建沿袭图。当您调用实际的 Spark 操作时，将执行沿袭图中的操作。

仅当您通过 .cache() 或 .persist().

缓存中间结果时才需要额外内存

Spark 中数据帧操作的时间复杂度和内存占用是多少？

What is the time complexity and memory footprint of dataframe operations in Spark?

time-complexity

memory-consumption

space-complexity

apache-spark-sql