Spark 2.0 是否真的解除了单个 SparkContext 的限制？

Question

关于 Spark 2.0 支持多个 SparkContext 的讨论很多。支持它的配置变量已经存在了很长时间但实际上并不有效。

在 $SPARK_HOME/conf/spark-defaults.conf 中：

spark.driver.allowMultipleContexts true

让我们验证属性是否被识别：

scala>     println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
allowMultiCtx = true

这是一个小的poc程序：

import org.apache.spark._
import org.apache.spark.streaming._
println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
def createAndStartFileStream(dir: String) = {
  val sc = new SparkContext("local[1]",s"Spark-$dir" /*,conf*/)
  val ssc = new StreamingContext(sc, Seconds(4))
  val dstream = ssc.textFileStream(dir)
  val valuesCounts = dstream.countByValue()
  ssc.start
  ssc.awaitTermination
}
val dirs = Seq("data10m", "data50m", "dataSmall").map { d =>
  s"/shared/demo/data/$d"
}
dirs.foreach{ d =>
  createAndStartFileStream(d)
}

但是尝试使用该功能未成功时：

16/08/14 11:38:55 WARN SparkContext: Multiple running SparkContexts detected 
in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in
this JVM (see SPARK-2243). To ignore this error, 
set spark.driver.allowMultipleContexts = true. 
The currently running SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:814)
org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)

有人知道如何使用多个上下文吗？

Answer 1

根据@LostInOverflow，此功能不会被修复。这是来自那个 jira

的信息

SPARK-2243 Support multiple SparkContexts in the same JVM

https://issues.apache.org/jira/browse/SPARK-2243

Sean Owen added a comment - 16/Jan/16 17:35 You say you're concerned with over-utilizing a cluster for steps that don't require much resource. This is what dynamic allocation is for: the number of executors increases and decreases with load. If one context is already using all cluster resources, yes, that doesn't do anything. But then, neither does a second context; the cluster is already fully used. I don't know what overhead you're referring to, but certainly one context running N jobs is busier than N contexts running N jobs. Its overhead is higher, but the total overhead is lower. This is more an effect than a cause that would make you choose one architecture over another. Generally, Spark has always assumed one context per JVM and I don't see that changing, which is why I finally closed this. I don't see any support for making this happen.

Spark 2.0 是否真的解除了单个 SparkContext 的限制？

Has the limitation of a single SparkContext actually been lifted in Spark 2.0?

scala

apache-spark

spark-streaming