在 Hadoop 中启用 RM HA 时,备用 ResourceManager (RM) 如何变为活动状态

How does a standby ResourceManager (RM) become active when RM HA is enabled in Hadoop

我已经完成了 RM HA 的 documentation。我了解基本原理,除了一个关键部分。当活动 RM 出现故障时,备用 RM 如何知道其中一个需要接管?以下是文档的相关部分:

The ZooKeeper state store achieves this implicit fencing through ACLs. All the ResourceManagers have shared read-write-admin access to the store, but only the Active has create-delete access. A ResourceManager claims this create-delete access while transitioning to Active. At this point, any other ResourceManager that previously had create-delete access loses access, fails to make changes to the store, and transitions itself to Standby. By having each ResourceManager create a dummy znode every so often (10 seconds, by default), a ResourceManager is always informed of its access to the store.

是否意味着所有 RM 都定期向 Zookeeper 发送消息以获取对 ZKResourceManagerStateStore 的访问权限?谁获得创建-删除访问权限,谁就承担了 Active 的角色?

更新:找到 this 很棒的文章,详细解释了 RM HA 的工作原理。留作参考。

Zookeeper 以协调着称。因此,当您阅读了您在问题中提到 link 的文档时,我假设您已经阅读了 A​​utomatic Failover 部分。

Zookeeper通过epoch number策略来维护Active-Standby High Availability。两个 RM 都参加了领导者选举,但只有一个具有最小纪元数的人被选为领导者。 RM 不需要像 Namenodes 这样的 Zookeeper Failover Controller。默认情况下,activestandbyelector 嵌入在 RM 中。

这就是领导者向 Zookeeper 写入数据的原因,当写入失败时,Zookeeper 会认为 Active RM 已变得无响应并让另一个 RM 成为新的 Leader。