新 zookeeper 领导者的选举关闭了 Spark Master

Election of new zookeeper leader shuts down the Spark Master

我发现杀了leader zookeeper后master spark没反应了(当然我把leader选举任务交给了zookeeper)。以下是我在 Master Spark 节点上看到的错误日志。您有什么建议可以解决吗?

15/06/22 10:44:00 INFO ClientCnxn: Unable to read additional data from
> server sessionid 0x14dd82e22f70ef1, likely server has closed socket,
> closing socket connection and attempting reconnect 

15/06/22 10:44:00
> INFO ClientCnxn: Unable to read additional data from server sessionid
> 0x24dc5a319b40090, likely server has closed socket, closing socket
> connection and attempting reconnect 

15/06/22 10:44:01 INFO
> ConnectionStateManager: State change: SUSPENDED 

15/06/22 10:44:01 INFO
> ConnectionStateManager: State change: SUSPENDED 

15/06/22 10:44:01 WARN
> ConnectionStateManager: There are no ConnectionStateListeners
> registered. 

15/06/22 10:44:01 INFO ZooKeeperLeaderElectionAgent: We
> have lost leadership 

15/06/22 10:44:01 ERROR Master: Leadership has
> been revoked -- master shutting down.

这是预期的行为。您必须设置 'n' 个 master 数量,并且需要在所有 env.sh

中指定 zookeeper url
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"

请注意,zookeeper 维持法定人数。这意味着您需要有奇数个 zookeeper,并且只有在维持法定人数时 zookeeper 集群才会启动。由于 spark 依赖于 zookeeper,这意味着在 zookeeper quorum 得到维护之前,spark 集群不会启动。

当你设置两(n)个主节点并关闭一个 zookeeper 时,当前的主节点将关闭,新的主节点将被选出,所有工作节点将附加到新的主节点。

您应该通过给

启动您的工人
./start-slave.sh spark://master1:port1,master2:port2

您需要等待1-2分钟!!注意此故障转移。