persist 如何在 Scala 中的 Derived DataFrame 上工作及其性能影响

Question

你能用下面的例子解释一下在 Scala 中持久化和取消持久化数据帧的效果吗？ persist/unpersist 对派生数据帧有什么影响？从下面的示例中，我不再坚持 dcRawAll，因为它不再被使用。但是，我读到，直到 派生数据帧上的所有操作都完成 ，我们不应该取消保留数据帧，因为缓存已被删除（或不会被创建）。（假设所有数据帧在取消持久化之前对其进行了更多操作）。

您能否解释一下对以下查询的性能影响？可以做些什么来优化它？

在此先感谢您的帮助。

    val dcRawAll = dataframe.select("C1","C2","C3","C4")   //dataframe is persisted
    dcRawAll.persist()

    val statsdcRawAll = dcRawAll.count()

    val dc = dcRawAll.where(col("c1").isNotNull)

    dc.persist()
    dcRawAll.unpersist(false)

    val statsdc = dc.count()

    val dcclean = dc.where(col("c2")=="SomeValue")
    dcclean.persist()
    dc.unpersist()

Answer 1

您的代码，正如当前实现的那样，根本没有进行任何缓存。您必须记住，.persist() 方法不会对您的 Dataframe 执行任何副作用，它只是返回一个具有以下功能的 new Dataframe坚持不懈。

在您对 dcRawAll.persist() 的调用中，您没有分配结果，因此您没有对可以保留的 Dataframe 的引用。纠正那个（非常常见的）错误，缓存仍然没有按照您希望的方式提供帮助。下面我将评论您的代码，进一步详细解释执行过程中可能发生的情况。

//dcRawAll will contian a Dataframe, that will be cached after its next action
val dcRawAll = dataframe.select("C1","C2","C3","C4").persist()

//after this line, dcRawAll is calculated, then cached
val statsdcRawAll = dcRawAll.count()

//dc will contain a Dataframe that will be cached after its next action
val dc = dcRawAll.where(col("c1").isNotNull).persist()

//at this point, you've removed the dcRawAll cache never having used it
//since dc has never had an action performed yet
//if you want to make use of this cache, move the unpersist _after_ the
//dc.count()
dcRawAll.unpersist(false)

//dcRawAll is recalculated from scratch, and then dc is calculated from that
//and then cached
val statsdc = dc.count()

//dcclean will contain a dataframe that will be cached after its next action
val dcclean = dc.where(col("c2")=="SomeValue").persist()

//at this point, you've removed the dc cache having never used it
//if you perform a dcclean.count() before this, it will utilize the dc cache
//and stage the cache for dcclean, to be used on some other dcclean action
dc.unpersist()

基本上，您需要确保在依赖于它的任何 Dataframe 执行操作之前，不要 .unpersist() a Dataframe。阅读答案（和链接文档）以更好地理解转换和操作之间的区别。

persist 如何在 Scala 中的 Derived DataFrame 上工作及其性能影响

How persist works on Derived DataFrame in Scala and its performance impact

performance

scala

persist

dataframe

apache-spark