Spark / Mesos / 任务丢失，奴隶被列入黑名单，执行者被移除

Question

我正在运行使用 Scala 2.11.11 在 Spark 2.2.0 上执行 spark-submit 作业，在 Mesos 1.4.2 上执行 SBT。

我遇到任务丢失和执行者未注册的问题。以下是症状：

MesosCoarseGrainedSchedulerBackend 启动任务，直到达到 spark.cores.max。例如这里它启动了 6 个任务：

18/06/11 12:49:54 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585462 with attributes: Map() mem: 423417.0 cpu: 55.5 ports: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000)).  Launching 2 Mesos tasks.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 2 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 0 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585463 with attributes: Map() mem: 300665.0 cpu: 71.5 ports: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000)).  Launching 3 Mesos tasks.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 4 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 3 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 1 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585464 with attributes: Map() mem: 423417.0 cpu: 55.5 ports: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000)).  Launching 1 Mesos tasks.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 5 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585465 with attributes: Map() mem: 300665.0 cpu: 71.5 port: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000)) for 120 seconds  (reason: reached spark.cores.max)

然后紧接着就开始丢任务黑名单奴隶还以为我设置了spark.blacklist.enabled=false

18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 0 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S0 due to too many failures; is Spark installed on it?
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 4 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 3 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S1 due to too many failures; is Spark installed on it?
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S1 due to too many failures; is Spark installed on it?

之后 non-existent 执行者被移除

18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 2 with reason Executor finished with state LOST
18/06/11 12:49:56 INFO BlockManagerMaster: Removal of executor 2 requested
18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585466 with attributes: Map() mem: 300665.0 cpu: 71.5 port: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000)) 
18/06/11 12:49:56 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585467 with attributes: Map() mem: 412153.0 cpu: 35.5 port: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000)) 
18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 0 with reason Executor finished with state LOST
18/06/11 12:49:56 INFO BlockManagerMaster: Removal of executor 0 requested
18/06/11 12:49:56 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 4 with reason Executor finished with state LOST
18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 4 requested
18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 4
18/06/11 12:49:59 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 3 with reason Executor finished with state LOST
18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 3 requested
18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
18/06/11 12:49:59 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 1 with reason Executor finished with state LOST
18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 1 requested
18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
18/06/11 12:49:59 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_RUNNING
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.

但是请注意，单个任务 5 并没有丢失，执行者 5 也没有被删除

18/06/11 12:49:59 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_RUNNING
18/06/11 12:50:01 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (SlaveIp:46884) with ID 5
18/06/11 12:50:01 INFO BlockManagerMasterEndpoint: Registering block manager SpaveIP:32840 with 5.2 GB RAM, BlockManagerId(5, SlaveIP, 32840, None)

这是我的 SparkSession 设置：

val spark = SparkSession.builder
.config("spark.executor.cores", 20)
.config("spark.executor.memory", "10g")
.config("spark.sql.shuffle.partitions", numPartitionsShuffle)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.network.timeout", "1200s")
.config("spark.blacklist.enabled", false)
.config("spark.blacklist.maxFailedTaskPerExecutor", 100)
.config("spark.dynamicAllocation.enabled", false)
.getOrCreate()

这是我的 spark-submit 脚本

spark-submit \
  --class MyMainClass \
  --master mesos://masterIP:7077 \
  --total-executor-cores 120 \
  --driver-memory 200g \
  --deploy-mode cluster \
  --name MyMainClass \
  --conf "spark.shuffle.service.enabled=false" \
  --conf "spark.dynamicAllocation.enabled=false" \
  --conf "spark.blacklist.enabled=false" \
  --conf "spark.blacklist.maxFailedTaskPerExecutor=100" \
  --verbose \
  myJar-assembly-0.1.0-SNAPSHOT.jar

注：

我注意到，如果我休息一下，运行工作通常运行会很好。但是如果我尝试快速连续运行个作业或者在我杀死前一个作业之后，就会出现上述问题。
我的集群上有足够的资源来运行这些任务
我正在复制 SparkSession 和 spark-submit 中的设置，因为似乎 config 与 --conf 的优先级并不总是很清楚。
在非动态模式下运行很重要。
丢失的执行者是
我将调试日志与我们基于 Spark 2.0.1 的仍然活跃的退休集群安装的调试日志进行了比较。完全相同的代码会启动立即获得 TASK_RUNNING 状态的任务。
我的 google 和 Whosebug 搜索没有产生任何有用的信息。
spark.blacklist.maxFailedTaskPerExecutor 和 spark.blacklist.enabled 的设置似乎不起作用
相关未回答的问题 [Spark on Mesos (DC/OS) 在做任何事情之前丢失任务 ](Spark on Mesos (DC/OS) loses tasks before doing anything

我完全不知道这里发生了什么。

问题：

你需要更多信息来帮助我诊断吗？
为什么作业一启动就丢失了大部分任务？我看到 Task Reasons 但 none 的原因似乎可以解释它。
为什么说要求删除不存在的执行器？
应该从哪个方向看？
这是否与之前的工作被杀死而没有等待足够长的时间启动下一个工作有关？

Answer 1

我正在回答我自己的问题：

我们发现我们的问题是双重的。

master 和 worker 之间 communication/connection 的一些未识别问题导致 Mesos 任务（执行者）丢失。日志中没有任何内容可以解释这个问题。
每当一个工人丢失至少 2 个 Mesos 任务时，它就会被列入黑名单。在 Spark 2.2 中，2 的限制是硬编码在代码中的，无法更改。详情见此：Blacklist is always active for MesosCoarseGrainedSchedulerBackend

结果：

有时没有发生通信问题，作业正常执行。
大多数情况下，所有执行者都在工作开始时就丢失了。通过在我们的集群中拥有 2 个 worker，我们可以同时运行只有 3 个 executor。在作业开始时，所有执行程序（worker1 上的 2 个和 worker2 上的 1 个）都将丢失，但只有 worker1 会被列入黑名单，丢失的执行程序将在 worker2 上重新启动并继续运行没有问题。

解法：

我不确定这是否是这个问题的通用解决方案，但我们有点盲目地搜索了调节不同 mesos timeout 机制的配置，我们在 Mesos 1.4 中发现了这个错误：

Using a failoverTimeout of 0 with Mesos native scheduler client can result in infinite subscribe loop

作为测试，我们设置了 SparkSession 配置 spark.mesos.driver.failoverTimeout=1.0。这似乎解决了我们的问题。我们不会在工作开始时就失去我们的执行者。

Spark / Mesos / 任务丢失，奴隶被列入黑名单，执行者被移除

Spark / Mesos / Tasks lost, slaves blacklisted, executors removed

scala

mesos

apache-spark

spark-submit