getOrCreate 部署随机失败

getOrCreate deployment failing randomly

尝试使用有效的 SparkContext 调用 H2OContext.getOrCreate 时,随机地我们不断看到部署失败:

17/04/21 17:21:32 ERROR TaskSchedulerImpl: Lost executor 0 on 172.17.0.4: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/21 17:21:38 ERROR LiveListenerBus: Listener ExecutorAddNotSupportedListener threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
    at org.apache.spark.listeners.ExecutorAddNotSupportedListener.onExecutorAdded(H2OSparkListener.scala:27)
    at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:61)
    at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
    at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
    at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
    at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
    at org.apache.spark.scheduler.LiveListenerBus$$anon$$anonfun$run$$anonfun$apply$mcV$sp.apply$mcV$sp(LiveListenerBus.scala:94)
    at org.apache.spark.scheduler.LiveListenerBus$$anon$$anonfun$run$$anonfun$apply$mcV$sp.apply(LiveListenerBus.scala:79)
    at org.apache.spark.scheduler.LiveListenerBus$$anon$$anonfun$run$$anonfun$apply$mcV$sp.apply(LiveListenerBus.scala:79)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
    at org.apache.spark.scheduler.LiveListenerBus$$anon$$anonfun$run.apply$mcV$sp(LiveListenerBus.scala:78)
    at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1252)
    at org.apache.spark.scheduler.LiveListenerBus$$anon.run(LiveListenerBus.scala:77) 

H2OContext.getOrCreate 导致错误:

Context.spark_session = SparkSession.builder.getOrCreate()
Context.h2o_context = H2OContext.getOrCreate(Context.spark_session)

H2O Crew 有什么想法吗?

这是目前 Sparkling Water 内部后端的已知行为。为避免这种情况,可以使用外部 Sparkling Water 后端。有关此的更多信息,请参见此处 https://github.com/h2oai/sparkling-water/blob/master/doc/backends.md

我目前正在开发这个 JIRA,它应该也能消除上述行为。正在进行中,可以跟踪此 JIRA https://0xdata.atlassian.net/browse/SW-369 以获取任务状态。