火花流过程中使用的全局 class 变量:它是广播变量吗?

global class variable used in a spark streaming process : is it a broadcasted variable?

我只需要知道在 SparkStreaming 进程中使用的全局 public class 变量是否会被视为广播变量。

至此,我成功地将一个预先设置的变量"inventory"用于JavaDStream转换。

class Foo {

  public static Map<String,String> inventory;

  public static void main(String args) {

    inventory = Inventory.load(); // here i set the variable

    SparkSession sparkSession = ...

    JavaStreamingContext ssc = ... // here i initialize the Spark Streaming Context

    JavaInputDStream<ConsumerRecord<String, String>> records = ...

    JavaDStream<Map<String,Object>> processedRecords = records.flatMap(rawRecord->{
      return f(rawRecord,inventory); // just an example...
    }

  }

}

我的理解是lambda表达式(rawRecord)中的部分是分布式的,然后我假设"inventory"被广播给执行该过程的每个执行者,是吗?

是的,您必须广播该变量以供分布式环境中的所有执行程序使用。

全局 class 变量不同于广播变量。

Using a class variable is fine but this can be inefficient, especially for large variables such as a lookup table or a machine learning model. The reason for this is that when you use a variable in a closure or a class variable in your case , it must be deserialised on the worker nodes many times (one per task). Moreover, if you use the same variable in multiple Spark actions and jobs, it will be re-sent to the workers with every job instead of once.

广播变量是共享的、不可变的变量 被缓存在集群中的每台机器上,而不是与每个任务序列化

您只需

 Broadcast<Map<String,String>> broadcast = ssc.sparkContext().broadcast(inventory);

并访问它

broadcast.value().get(key)