广播变量存储在 Spark 中的什么位置?
Where are broadcast variables stored in Spark?
根据官方文档,"Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks"
假设在我的 spark-submit 命令中,我将 -num-executors 设置为 10 。我的集群是 2 节点集群,现在假设 5 个执行程序在节点 1 中启动,接下来的 5 个执行程序在节点 2 中启动。
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
根据文档,此 broadcastVar 是否会在每个执行程序的存储内存中可用,这意味着 broadcastVar 可作为 10 个副本使用?
或
这个 broadcastVar 是否在每个节点的磁盘内存中可用。所以 2 个节点每个都获得 broadcastVar 的副本,因此每个节点的所有执行程序 运行 都可以获取 broadcastVar?
查看 TorrentBroadcast 中广播的实现方式 class:
The driver divides the serialized object into small chunks and
stores those chunks in the BlockManager of the driver.
On each executor, the executor first attempts to fetch the object from its BlockManager. If
it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
other executors if available. Once it gets the chunks, it puts the chunks in its own
BlockManager, ready for other executors to fetch from. we can see that broadcast variables are stored in executor's BlockManager
因此每个执行器都有自己的副本,由其 BlockManager 管理。
同样代表累加器变量。
根据官方文档,"Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks"
假设在我的 spark-submit 命令中,我将 -num-executors 设置为 10 。我的集群是 2 节点集群,现在假设 5 个执行程序在节点 1 中启动,接下来的 5 个执行程序在节点 2 中启动。
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
根据文档,此 broadcastVar 是否会在每个执行程序的存储内存中可用,这意味着 broadcastVar 可作为 10 个副本使用?
或
这个 broadcastVar 是否在每个节点的磁盘内存中可用。所以 2 个节点每个都获得 broadcastVar 的副本,因此每个节点的所有执行程序 运行 都可以获取 broadcastVar?
查看 TorrentBroadcast 中广播的实现方式 class:
The driver divides the serialized object into small chunks and
stores those chunks in the BlockManager of the driver.
On each executor, the executor first attempts to fetch the object from its BlockManager. If
it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
other executors if available. Once it gets the chunks, it puts the chunks in its own
BlockManager, ready for other executors to fetch from. we can see that broadcast variables are stored in executor's BlockManager
因此每个执行器都有自己的副本,由其 BlockManager 管理。
同样代表累加器变量。