如何在执行器中获取 spark 广播变量?火花芯
how to get the spark broadcast variable in the executor? spark-core
在这里,我有一个应用程序有两个作业。在第一份工作中,我想设置广播,例如设置广播变量"true",在执行程序中访问广播。在第二份工作中,我想设置广播变量"false"。以及如何达到要求?
我的代码是:
val conf = new SparkConf()
val sc = new SparkContext(conf)
var setCapture = true
sc.broadcast(setCapture)
val file = lc.textFile("file" ,2)
val flatMap = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
val report = counts.collect()
setCapture = false
sc.broadcast(setCapture)
val packageResult = sc.parallelize(report).filter(_._1 == "package")
packageResult.collect.foreach(println)
并且我想访问
中的广播变量“setCapture”
org.apache.spark.scheduler.ResultTask,
org.apache.spark.rdd.HadoopRDD,
org.apache.spark.util.collection.ExternalAppendOnlyMap,
org.apache.spark.shuffle.hash.HashShuffleWriter.
我该怎么办?
来自 Spark 文档
Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).
在这里,我有一个应用程序有两个作业。在第一份工作中,我想设置广播,例如设置广播变量"true",在执行程序中访问广播。在第二份工作中,我想设置广播变量"false"。以及如何达到要求? 我的代码是:
val conf = new SparkConf()
val sc = new SparkContext(conf)
var setCapture = true
sc.broadcast(setCapture)
val file = lc.textFile("file" ,2)
val flatMap = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
val report = counts.collect()
setCapture = false
sc.broadcast(setCapture)
val packageResult = sc.parallelize(report).filter(_._1 == "package")
packageResult.collect.foreach(println)
并且我想访问
中的广播变量“setCapture”org.apache.spark.scheduler.ResultTask,
org.apache.spark.rdd.HadoopRDD,
org.apache.spark.util.collection.ExternalAppendOnlyMap,
org.apache.spark.shuffle.hash.HashShuffleWriter.
我该怎么办?
来自 Spark 文档
Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).