Spark 2.0 是否真的解除了单个 SparkContext 的限制?
Has the limitation of a single SparkContext actually been lifted in Spark 2.0?
关于 Spark 2.0 支持多个 SparkContext
的讨论很多。支持它的配置变量已经存在了很长时间但实际上并不有效。
在 $SPARK_HOME/conf/spark-defaults.conf
中:
spark.driver.allowMultipleContexts true
让我们验证 属性 是否被识别:
scala> println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
allowMultiCtx = true
这是一个小的poc程序:
import org.apache.spark._
import org.apache.spark.streaming._
println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
def createAndStartFileStream(dir: String) = {
val sc = new SparkContext("local[1]",s"Spark-$dir" /*,conf*/)
val ssc = new StreamingContext(sc, Seconds(4))
val dstream = ssc.textFileStream(dir)
val valuesCounts = dstream.countByValue()
ssc.start
ssc.awaitTermination
}
val dirs = Seq("data10m", "data50m", "dataSmall").map { d =>
s"/shared/demo/data/$d"
}
dirs.foreach{ d =>
createAndStartFileStream(d)
}
但是尝试使用该功能未成功时:
16/08/14 11:38:55 WARN SparkContext: Multiple running SparkContexts detected
in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in
this JVM (see SPARK-2243). To ignore this error,
set spark.driver.allowMultipleContexts = true.
The currently running SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:814)
org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
有人知道如何使用多个上下文吗?
根据@LostInOverflow,此功能不会被修复。这是来自那个 jira
的信息
SPARK-2243 Support multiple SparkContexts in the same JVM
https://issues.apache.org/jira/browse/SPARK-2243
Sean Owen added a comment - 16/Jan/16 17:35 You say you're concerned
with over-utilizing a cluster for steps that don't require much
resource. This is what dynamic allocation is for: the number of
executors increases and decreases with load. If one context is already
using all cluster resources, yes, that doesn't do anything. But then,
neither does a second context; the cluster is already fully used. I
don't know what overhead you're referring to, but certainly one
context running N jobs is busier than N contexts running N jobs. Its
overhead is higher, but the total overhead is lower. This is more an
effect than a cause that would make you choose one architecture over
another. Generally, Spark has always assumed one context per JVM and I
don't see that changing, which is why I finally closed this. I don't
see any support for making this happen.
关于 Spark 2.0 支持多个 SparkContext
的讨论很多。支持它的配置变量已经存在了很长时间但实际上并不有效。
在 $SPARK_HOME/conf/spark-defaults.conf
中:
spark.driver.allowMultipleContexts true
让我们验证 属性 是否被识别:
scala> println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
allowMultiCtx = true
这是一个小的poc程序:
import org.apache.spark._
import org.apache.spark.streaming._
println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
def createAndStartFileStream(dir: String) = {
val sc = new SparkContext("local[1]",s"Spark-$dir" /*,conf*/)
val ssc = new StreamingContext(sc, Seconds(4))
val dstream = ssc.textFileStream(dir)
val valuesCounts = dstream.countByValue()
ssc.start
ssc.awaitTermination
}
val dirs = Seq("data10m", "data50m", "dataSmall").map { d =>
s"/shared/demo/data/$d"
}
dirs.foreach{ d =>
createAndStartFileStream(d)
}
但是尝试使用该功能未成功时:
16/08/14 11:38:55 WARN SparkContext: Multiple running SparkContexts detected
in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in
this JVM (see SPARK-2243). To ignore this error,
set spark.driver.allowMultipleContexts = true.
The currently running SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:814)
org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
有人知道如何使用多个上下文吗?
根据@LostInOverflow,此功能不会被修复。这是来自那个 jira
的信息SPARK-2243 Support multiple SparkContexts in the same JVM
https://issues.apache.org/jira/browse/SPARK-2243
Sean Owen added a comment - 16/Jan/16 17:35 You say you're concerned with over-utilizing a cluster for steps that don't require much resource. This is what dynamic allocation is for: the number of executors increases and decreases with load. If one context is already using all cluster resources, yes, that doesn't do anything. But then, neither does a second context; the cluster is already fully used. I don't know what overhead you're referring to, but certainly one context running N jobs is busier than N contexts running N jobs. Its overhead is higher, but the total overhead is lower. This is more an effect than a cause that would make you choose one architecture over another. Generally, Spark has always assumed one context per JVM and I don't see that changing, which is why I finally closed this. I don't see any support for making this happen.