为什么当 zookeeper 重新上线时 curator 没有恢复?

Why isn't curator recovering when zookeeper is back online?

我有一个 CuratorFramework 客户端 (v5.1.0) 运行针对 Zookeeper 服务器 (v3.7.0)。如果 Zookeeper 服务器在客户端连接到它时关闭我可以看到连接状态 (带有 ConnectionStateListener)的 SUSPENDED,然后是 LOST,然后仅此而已 当服务器重新联机时。

这感觉像是一个非常标准的用例,我一定错过了一些愚蠢的东西,但我永远不能 一旦服务器在线,让客户端再次连接。

我进行了一些 google 搜索,但没有发现关于如何处理丢失状态后的恢复的任何有用信息。

我有一个self-contained example我正在做什么 中的示例代码 CuratorRecoveryTest class (运行 在 IDE 而不是 maven 中)。它的核心是(从测试中提取class):

// setup the server and client
server = new TestingServer();

client = newClient(server.getConnectString(), 60000, 15000, new RetryNTimes(1, 250));
client.start();
client.blockUntilConnected();
            
// add the listener
final var stateListener = new StateListener();
stateListener.stateChanged(client, CONNECTED);

// register the listener
client.getConnectionStateListenable().addListener(stateListener);

// verify connection
assertTrue(client.getZookeeperClient().isConnected());

// let things settle
nap(3, "initial settling");

// stop zk
stopServer();
log.info(">>>>>>>>>> STOPPED ZK SERVER");

// let it bake
nap(3, "letting things bake");

// ensure disconnected
assertFalse(client.getZookeeperClient().isConnected());

nap(3, "disconnecting");

// start zk
server.start();
log.info(">>>>>>>>>> STARTED ZK SERVER");

await().atMost(5, MINUTES).until(() -> stateListener.getCurrentState() == CONNECTED || stateListener.getCurrentState() == RECONNECTED);

// NOTE: it never gets here - no state changes after LOST

assertTrue(client.getZookeeperClient().isConnected());

当这是 运行 时,我得到以下输出:

[Thread-0] INFO org.apache.curator.test.TestingZooKeeperMain - Starting server
[Thread-0] WARN org.apache.zookeeper.server.ServerCnxnFactory - maxCnxns is not configured, using default value 0.
[main] INFO org.apache.curator.framework.imps.CuratorFrameworkImpl - Starting
[main] INFO org.apache.curator.framework.imps.CuratorFrameworkImpl - Default schema
[main] WARN demo.CuratorRecoveryTest - CONNECTION-STATE-CHANGE: null --> CONNECTED
[main] DEBUG demo.CuratorRecoveryTest - Taking a 3s nap for initial settling...
[main] DEBUG demo.CuratorRecoveryTest - Done napping for initial settling...
[Curator-ConnectionStateManager-0] WARN demo.CuratorRecoveryTest - CONNECTION-STATE-CHANGE: CONNECTED --> SUSPENDED
[main] INFO demo.CuratorRecoveryTest - >>>>>>>>>> STOPPED ZK SERVER
[main] DEBUG demo.CuratorRecoveryTest - Taking a 3s nap for letting things bake...
[main] DEBUG demo.CuratorRecoveryTest - Done napping for letting things bake...
[main] DEBUG demo.CuratorRecoveryTest - Taking a 3s nap for disconnecting...
[main] DEBUG demo.CuratorRecoveryTest - Done napping for disconnecting...
[main] INFO demo.CuratorRecoveryTest - >>>>>>>>>> STARTED ZK SERVER
[Curator-ConnectionStateManager-0] WARN org.apache.curator.framework.state.ConnectionStateManager - Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 20009. Adjusted session timeout ms: 20000
[main-EventThread] WARN org.apache.curator.ConnectionState - Session expired event received
[Curator-ConnectionStateManager-0] WARN demo.CuratorRecoveryTest - CONNECTION-STATE-CHANGE: SUSPENDED --> LOST

当等待条件从未发生时,它就会失败。

NOTE: This happens on an older version combination of Curator and Zookeeper as well, so this is not a "bleeding edge" issue.

我错过了什么?

我有一个类似的问题,并得出结论,当 zookeeper 服务器重新启动时,馆长似乎重用了过时的 IP。

this ticket worked for me. In particular, this commit 中概述的方法添加了自定义 ZookeeperFactory,它不重用以前的过时 IP,而是使用原始的未解析主机名。

简而言之,在创建 curator 时,分配一个自定义 ZookeeperFactory

CuratorFramework zkClient = CuratorFrameworkFactory
    .builder()
...
    .zookeeperFactory(new ZKClientFactory())

其中 ZKClientFactory 从缓存的 connectString 创建新的 Zookeeper