persist 如何在 Scala 中的 Derived DataFrame 上工作及其性能影响

How persist works on Derived DataFrame in Scala and its performance impact

你能用下面的例子解释一下在 Scala 中持久化和取消持久化数据帧的效果吗? persist/unpersist 对派生数据帧有什么影响?从下面的示例中,我不再坚持 dcRawAll,因为它不再被使用。但是,我读到,直到 派生数据帧上的所有操作都完成 ,我们不应该取消保留数据帧,因为缓存已被删除(或不会被创建)。 (假设所有数据帧在取消持久化之前对其进行了更多操作)。

您能否解释一下对以下查询的性能影响?可以做些什么来优化它?

在此先感谢您的帮助。

    val dcRawAll = dataframe.select("C1","C2","C3","C4")   //dataframe is persisted
    dcRawAll.persist()

    val statsdcRawAll = dcRawAll.count()

    val dc = dcRawAll.where(col("c1").isNotNull)

    dc.persist()
    dcRawAll.unpersist(false)

    val statsdc = dc.count()

    val dcclean = dc.where(col("c2")=="SomeValue")
    dcclean.persist()
    dc.unpersist()

您的代码,正如当前实现的那样,根本没有进行 任何 缓存。您必须记住,.persist() 方法不会对您的 Dataframe 执行任何副作用,它只是返回一个具有以下功能的 new Dataframe坚持不懈。

在您对 dcRawAll.persist() 的调用中,您没有分配结果,因此您没有对 可以 保留的 Dataframe 的引用。纠正那个(非常常见的)错误,缓存仍然没有按照您希望的方式提供帮助。下面我将评论您的代码,进一步详细解释执行过程中可能发生的情况。

//dcRawAll will contian a Dataframe, that will be cached after its next action
val dcRawAll = dataframe.select("C1","C2","C3","C4").persist()

//after this line, dcRawAll is calculated, then cached
val statsdcRawAll = dcRawAll.count()

//dc will contain a Dataframe that will be cached after its next action
val dc = dcRawAll.where(col("c1").isNotNull).persist()

//at this point, you've removed the dcRawAll cache never having used it
//since dc has never had an action performed yet
//if you want to make use of this cache, move the unpersist _after_ the
//dc.count()
dcRawAll.unpersist(false)

//dcRawAll is recalculated from scratch, and then dc is calculated from that
//and then cached
val statsdc = dc.count()

//dcclean will contain a dataframe that will be cached after its next action
val dcclean = dc.where(col("c2")=="SomeValue").persist()

//at this point, you've removed the dc cache having never used it
//if you perform a dcclean.count() before this, it will utilize the dc cache
//and stage the cache for dcclean, to be used on some other dcclean action
dc.unpersist()

基本上,您需要确保在依赖于它的任何 Dataframe 执行操作之前,不要 .unpersist() a Dataframe。阅读 答案(和链接文档)以更好地理解转换和操作之间的区别。