Spark中使用缓存怎么理解？

Question

在我的 Scala/Spark 应用程序中，我创建了 DataFrame。我计划在整个程序中多次使用这个 Dataframe。这就是为什么我决定对该 DataFrame 使用 .cache() 方法的原因。正如您在循环中看到的那样，我使用不同的值多次过滤 DataFrame。出于某种原因 .count() 方法 return 给我的结果总是一样的。事实上，它必须 return 两个不同的计数值。另外，我注意到 Mesos 中的奇怪行为。感觉.cache()方法没有被执行。创建DataFrame后，程序进入到这部分代码if (!df.head(1).isEmpty)，执行了很长时间。我假设缓存进程会运行很长一段时间，其他进程会使用这个缓存和运行很快。您认为问题是什么？

import org.apache.spark.sql.DataFrame

var df: DataFrame = spark
    .read
    .option("delimiter", "|")
    .csv("/path_to_the_files/")
    .filter(col("col5").isin("XXX", "YYY", "ZZZ"))

df.cache()

var array1 = Array("111", "222")

var array2 = Array("333")

var storage = Array(array1, array2)

if (!df.head(1).isEmpty) {
    for (item <- storage) {
        df.filter(
            col("col1").isin(item:_*)
        )

        println("count: " + df.count())
    }
}

Answer 1

In fact, it must return two different count values.

为什么？您在同一个 df 上调用它。也许你的意思是

val df1 = df.filter(...)
println("count: " + df1.count())

I assumed that the caching process would run for a long time, and the other processes would use this cache and run quickly.

它确实如此，但只有当依赖于此数据帧的第一个 action 被执行时，head 才是那个动作。所以你应该期待

the program goes to this part of code if (!df.head(1).isEmpty) and performs it for a very long time

如果没有缓存，您也会为两个 df.count() 调用获得相同的时间，除非 Spark 检测到它并自行启用缓存。

Spark中使用缓存怎么理解？

How do I understand that caching is used in Spark?

scala

dataframe

mesos

apache-spark