Spark 流和可变广播变量

Spark streaming and mutable broadcast variable

我发现这个 link https://gist.github.com/BenFradet/c47c5c7247c5d5d0f076 它显示了在 spark 中更新广播变量的实现。这是一个有效的实现意味着执行者会看到广播变量的最新值吗?

您所指的代码使用的是 Broadcast.unpersist() 方法。如果你勾选 Spark API Broadcast.unpersist() method it says "Asynchronously delete cached copies of this broadcast on the executors. If the broadcast is used after this is called, it will need to be re-sent to each executor." There is an overloaded method unpersist(boolean blocking) which will block until unpersisting has completed. So it depends how are you using Broadcast variable in your Spark application. In spark there is no auto-re-broadcast if you mutate a broadcast variable. Driver has to resend it. Spark documentation says you shouldn't modify broadcast variable (Immutable) to avoid any inconsistency in processing at executor nodes but there are unpersist() and destroy() methods available if you want to control the broadcast variable's life cycle. Please refer spark jira https://issues.apache.org/jira/browse/SPARK-6404