什么条件导致马拉松领导人选举？

What conditions cause a Marathon leader election?

我正在使用 Mesos 和 Marathon 来管理应用程序部署，并且运行遇到了 Marathon https://github.com/mesosphere/marathon/issues/3783 中的这个错误，也就是说部署期间的领导者选举会缩减实例到 0。领导人选举发生得非常频繁（大约每 30 分钟一次），所以我经常遇到这个问题。

我知道每 30 分钟一次是非常不规律的，因为自从我升级到 Marathon 1.3.10 并且过去 2 天一直没有选举，但是 "normal"?领导人退位/选举是否在正常情况下发生，或者我应该期待 0 次选举，除非存在潜在问题？一位同事向我建议 "leader elections are normal" 和 "certain number of elections are normal and to be expected"。我只是不相信，并且想确定地知道。

如果您的 Marathon 每 30 分钟重新选举一次，这是不正常的。 在正常情况下，Marathon 不应退位或重新选举新领导者，直到进行维护（更新或重启）。虽然如果发生这种情况可能是由 4 个主要问题引起的（所有问题都会导致超时）：

马拉松表现——当马拉松出现表现问题时，症状之一就是失去领导力。这是因为 Marathon 没有在给定的时间间隔内响应 Zookeeper 并被标记为消失。
Marathon Zookeeper 连接问题——当网络延迟太高时（例如，Zookeeper 集群与 Marathon 位于不同的 DC 中）然后一些更新可能会超时。这将导致失去领导力。
Zookeeper 性能 — 当 Zookeeper 有很多工作要做时，它会使一些请求超时，导致 Marathon 失去领导地位。
马拉松被迫退位 DELETE /v2/leader

要解决性能问题，请按照下面描述的步骤进行操作 here

Shard your marathon.

Monitor — enable metrics but remember to configure them.

Update to 1.3.10 or later.

Minimize Zookeeper communication latency and object size.

Tune JVM — add more heap and CPUs :).

Do not use the event bus — if you really need to, use filtered SSE, and accept it is asynchronous and events are delivered at most once.

If you need task life cycle events, use a custom executor.

Prefer batch deployments to many individual ones.

什么条件导致马拉松领导人选举？

What conditions cause a Marathon leader election?

marathon

mesos

mesosphere

apache-zookeeper