persist 如何在 Scala 中的 Derived DataFrame 上工作及其性能影响
How persist works on Derived DataFrame in Scala and its performance impact
你能用下面的例子解释一下在 Scala 中持久化和取消持久化数据帧的效果吗? persist/unpersist 对派生数据帧有什么影响?从下面的示例中,我不再坚持 dcRawAll,因为它不再被使用。但是,我读到,直到 派生数据帧上的所有操作都完成 ,我们不应该取消保留数据帧,因为缓存已被删除(或不会被创建)。 (假设所有数据帧在取消持久化之前对其进行了更多操作)。
您能否解释一下对以下查询的性能影响?可以做些什么来优化它?
在此先感谢您的帮助。
val dcRawAll = dataframe.select("C1","C2","C3","C4") //dataframe is persisted
dcRawAll.persist()
val statsdcRawAll = dcRawAll.count()
val dc = dcRawAll.where(col("c1").isNotNull)
dc.persist()
dcRawAll.unpersist(false)
val statsdc = dc.count()
val dcclean = dc.where(col("c2")=="SomeValue")
dcclean.persist()
dc.unpersist()
您的代码,正如当前实现的那样,根本没有进行 任何 缓存。您必须记住,.persist()
方法不会对您的 Dataframe
执行任何副作用,它只是返回一个具有以下功能的 new Dataframe
坚持不懈。
在您对 dcRawAll.persist()
的调用中,您没有分配结果,因此您没有对 可以 保留的 Dataframe
的引用。纠正那个(非常常见的)错误,缓存仍然没有按照您希望的方式提供帮助。下面我将评论您的代码,进一步详细解释执行过程中可能发生的情况。
//dcRawAll will contian a Dataframe, that will be cached after its next action
val dcRawAll = dataframe.select("C1","C2","C3","C4").persist()
//after this line, dcRawAll is calculated, then cached
val statsdcRawAll = dcRawAll.count()
//dc will contain a Dataframe that will be cached after its next action
val dc = dcRawAll.where(col("c1").isNotNull).persist()
//at this point, you've removed the dcRawAll cache never having used it
//since dc has never had an action performed yet
//if you want to make use of this cache, move the unpersist _after_ the
//dc.count()
dcRawAll.unpersist(false)
//dcRawAll is recalculated from scratch, and then dc is calculated from that
//and then cached
val statsdc = dc.count()
//dcclean will contain a dataframe that will be cached after its next action
val dcclean = dc.where(col("c2")=="SomeValue").persist()
//at this point, you've removed the dc cache having never used it
//if you perform a dcclean.count() before this, it will utilize the dc cache
//and stage the cache for dcclean, to be used on some other dcclean action
dc.unpersist()
基本上,您需要确保在依赖于它的任何 Dataframe
执行操作之前,不要 .unpersist()
a Dataframe
。阅读 答案(和链接文档)以更好地理解转换和操作之间的区别。
你能用下面的例子解释一下在 Scala 中持久化和取消持久化数据帧的效果吗? persist/unpersist 对派生数据帧有什么影响?从下面的示例中,我不再坚持 dcRawAll,因为它不再被使用。但是,我读到,直到 派生数据帧上的所有操作都完成 ,我们不应该取消保留数据帧,因为缓存已被删除(或不会被创建)。 (假设所有数据帧在取消持久化之前对其进行了更多操作)。
您能否解释一下对以下查询的性能影响?可以做些什么来优化它?
在此先感谢您的帮助。
val dcRawAll = dataframe.select("C1","C2","C3","C4") //dataframe is persisted
dcRawAll.persist()
val statsdcRawAll = dcRawAll.count()
val dc = dcRawAll.where(col("c1").isNotNull)
dc.persist()
dcRawAll.unpersist(false)
val statsdc = dc.count()
val dcclean = dc.where(col("c2")=="SomeValue")
dcclean.persist()
dc.unpersist()
您的代码,正如当前实现的那样,根本没有进行 任何 缓存。您必须记住,.persist()
方法不会对您的 Dataframe
执行任何副作用,它只是返回一个具有以下功能的 new Dataframe
坚持不懈。
在您对 dcRawAll.persist()
的调用中,您没有分配结果,因此您没有对 可以 保留的 Dataframe
的引用。纠正那个(非常常见的)错误,缓存仍然没有按照您希望的方式提供帮助。下面我将评论您的代码,进一步详细解释执行过程中可能发生的情况。
//dcRawAll will contian a Dataframe, that will be cached after its next action
val dcRawAll = dataframe.select("C1","C2","C3","C4").persist()
//after this line, dcRawAll is calculated, then cached
val statsdcRawAll = dcRawAll.count()
//dc will contain a Dataframe that will be cached after its next action
val dc = dcRawAll.where(col("c1").isNotNull).persist()
//at this point, you've removed the dcRawAll cache never having used it
//since dc has never had an action performed yet
//if you want to make use of this cache, move the unpersist _after_ the
//dc.count()
dcRawAll.unpersist(false)
//dcRawAll is recalculated from scratch, and then dc is calculated from that
//and then cached
val statsdc = dc.count()
//dcclean will contain a dataframe that will be cached after its next action
val dcclean = dc.where(col("c2")=="SomeValue").persist()
//at this point, you've removed the dc cache having never used it
//if you perform a dcclean.count() before this, it will utilize the dc cache
//and stage the cache for dcclean, to be used on some other dcclean action
dc.unpersist()
基本上,您需要确保在依赖于它的任何 Dataframe
执行操作之前,不要 .unpersist()
a Dataframe
。阅读