将局部变量 VS 广播变量传递给 spark pipeline 有什么区别?
What is the difference between passing local variable VS broadcast variable to spark pipeline?
考虑以下代码:
val rdd: RDD[String] = domainsRDD()
val backlistDomains: Set[String] = readDomainsBlacklist()
rdd.filter(domain => !backlistDomains.contains(domain)
广播列入黑名单的域的 VS 代码:
val rdd: RDD[String] = domainsRDD()
val bBacklistDomains: Set[String] = sc.broadcast(readDomainsBlacklist())
rdd.filter(domain => !bBacklistDomains.value.contains(domain))
尽管可以从执行程序中删除广播变量(通过 bBacklistDomains.destroy()
),还有其他使用它的理由吗(性能?)?
(请注意,在第一个代码示例中 domains
是局部变量,不会出现序列化问题)
有none,stage中使用的局部变量自动广播
Spark automatically broadcasts the common data needed by tasks within each stage.
The data broadcasted this way is cached in serialized form and deserialized before running each task.
This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
来自文档:https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
考虑以下代码:
val rdd: RDD[String] = domainsRDD()
val backlistDomains: Set[String] = readDomainsBlacklist()
rdd.filter(domain => !backlistDomains.contains(domain)
广播列入黑名单的域的 VS 代码:
val rdd: RDD[String] = domainsRDD()
val bBacklistDomains: Set[String] = sc.broadcast(readDomainsBlacklist())
rdd.filter(domain => !bBacklistDomains.value.contains(domain))
尽管可以从执行程序中删除广播变量(通过 bBacklistDomains.destroy()
),还有其他使用它的理由吗(性能?)?
(请注意,在第一个代码示例中 domains
是局部变量,不会出现序列化问题)
有none,stage中使用的局部变量自动广播
Spark automatically broadcasts the common data needed by tasks within each stage.
The data broadcasted this way is cached in serialized form and deserialized before running each task.
This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
来自文档:https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables