火花流过程中使用的全局 class 变量:它是广播变量吗?
global class variable used in a spark streaming process : is it a broadcasted variable?
我只需要知道在 SparkStreaming 进程中使用的全局 public class 变量是否会被视为广播变量。
至此,我成功地将一个预先设置的变量"inventory"用于JavaDStream转换。
class Foo {
public static Map<String,String> inventory;
public static void main(String args) {
inventory = Inventory.load(); // here i set the variable
SparkSession sparkSession = ...
JavaStreamingContext ssc = ... // here i initialize the Spark Streaming Context
JavaInputDStream<ConsumerRecord<String, String>> records = ...
JavaDStream<Map<String,Object>> processedRecords = records.flatMap(rawRecord->{
return f(rawRecord,inventory); // just an example...
}
}
}
我的理解是lambda表达式(rawRecord)中的部分是分布式的,然后我假设"inventory"被广播给执行该过程的每个执行者,是吗?
是的,您必须广播该变量以供分布式环境中的所有执行程序使用。
全局 class 变量不同于广播变量。
Using a class variable is fine but this can be inefficient, especially
for large variables such as a lookup table or a machine learning
model. The reason for this is that when you use a variable in a
closure or a class variable in your case , it must be deserialised on
the worker nodes many times (one per task). Moreover, if you use the
same variable in multiple Spark actions and jobs, it will be re-sent
to the workers with every job instead of once.
广播变量是共享的、不可变的变量
被缓存在集群中的每台机器上,而不是与每个任务序列化
您只需
Broadcast<Map<String,String>> broadcast = ssc.sparkContext().broadcast(inventory);
并访问它
broadcast.value().get(key)
我只需要知道在 SparkStreaming 进程中使用的全局 public class 变量是否会被视为广播变量。
至此,我成功地将一个预先设置的变量"inventory"用于JavaDStream转换。
class Foo {
public static Map<String,String> inventory;
public static void main(String args) {
inventory = Inventory.load(); // here i set the variable
SparkSession sparkSession = ...
JavaStreamingContext ssc = ... // here i initialize the Spark Streaming Context
JavaInputDStream<ConsumerRecord<String, String>> records = ...
JavaDStream<Map<String,Object>> processedRecords = records.flatMap(rawRecord->{
return f(rawRecord,inventory); // just an example...
}
}
}
我的理解是lambda表达式(rawRecord)中的部分是分布式的,然后我假设"inventory"被广播给执行该过程的每个执行者,是吗?
是的,您必须广播该变量以供分布式环境中的所有执行程序使用。
全局 class 变量不同于广播变量。
Using a class variable is fine but this can be inefficient, especially for large variables such as a lookup table or a machine learning model. The reason for this is that when you use a variable in a closure or a class variable in your case , it must be deserialised on the worker nodes many times (one per task). Moreover, if you use the same variable in multiple Spark actions and jobs, it will be re-sent to the workers with every job instead of once.
广播变量是共享的、不可变的变量 被缓存在集群中的每台机器上,而不是与每个任务序列化
您只需
Broadcast<Map<String,String>> broadcast = ssc.sparkContext().broadcast(inventory);
并访问它
broadcast.value().get(key)