在 Spark 中，对象和变量如何保存在内存中以及跨不同的执行器？

Question

我正在使用：

星火 3.0.0
斯卡拉 2.12

我正在编写一个带有自定义流源的 Spark 结构化流作业。在执行 spark 查询之前，我创建了一堆元数据，供我的 Spark Streaming 作业使用

我想了解这些元数据是如何跨不同的执行程序保存在内存中的？

示例代码：

case class JobConfig(fieldName: String, displayName: String, castTo: String)


val jobConfigs:List[JobConfig] = build(); //build the job configs

  val query = spark
    .readStream
    .format("custom-streaming")
    .load

  query
    .writeStream
    .trigger(Trigger.ProcessingTime(2, TimeUnit.MINUTES))
    .foreachBatch { (batchDF: DataFrame, batchId: Long) => {
      CustomJobExecutor.start(jobConfigs) //CustomJobExecutor does data frame transformations and save the data in PostgreSQL.
    }
    }.outputMode(OutputMode.Append()).start().awaitTermination()

需要帮助来理解以下内容：

在示例代码中，Spark 将如何在不同的执行程序之间将“jobConfigs”保存在内存中？

广播有什么额外的好处吗？

保留无法反序列化的变量的有效方法是什么？

Answer 1

为每个任务复制局部变量，同时仅为每个执行者复制广播变量。来自 docs

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

意思是如果你的jobConfig足够大，使用变量的tasks和stage的数量明显多于executor的数量，或者反序列化比较耗时，那么广播变量可以做一个区别。在其他情况下，他们不会。

在 Spark 中，对象和变量如何保存在内存中以及跨不同的执行器？

In Spark, how objects and variables are kept in memory and across different executors?

scala

apache-spark

spark-structured-streaming