Apache Ignite 不会以 6 节点集群启动 - 无法解析节点拓扑

Apache Ignite will not start with 6 node cluster - Failed to resolve nodes topology

我们有两个系统,一个有 2 个节点的 QA 系统和一个有 6 个节点的 Prod 系统。

QA 系统完美启动。我们有一个工作系统,所以我们提升到生产。

Prod 系统启动并在大约 16 秒后抛出这些错误,none 点燃缓存工作。

2 个节点启动,其他 4 个节点永远无法启动。

在未启动的节点之一上:

点燃消息来自:

2020-11-24 18:30:52 INFO  [] stdout:71 - [18:30:52]    __________  ________________ 
2020-11-24 18:30:52 INFO  [] stdout:71 - [18:30:52]   /  _/ ___/ |/ /  _/_  __/ __/ 
2020-11-24 18:30:52 INFO  [] stdout:71 - [18:30:52]  _/ // (7 7    // /  / / / _/   
2020-11-24 18:30:52 INFO  [] stdout:71 - [18:30:52] /___/\___/_/|_/___/ /_/ /___/  

并且在 2020-11-24 18:42:09 我们得到以下错误(清理数据):

2020-11-24 18:42:09 INFO  [] GridTcpRestProtocol:285 - Command protocol successfully stopped: TCP binary
2020-11-24 18:42:09 INFO  [] GridDhtPartitionsExchangeFuture:285 - Finish exchange future [startVer=AffinityTopologyVersion [topVer=8, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.NodeStoppingException: Node is stopping: null, rebalanced=false, wasRebalanced=false]
2020-11-24 18:42:09 INFO  [] GridDhtPartitionsExchangeFuture:285 - Completed partition exchange [localNode=4a0b2901-adc1-4416-8345-82caa6a18cea, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=8, minorTopVer=0], evt=NODE_LEFT, evtNode=TcpDiscoveryNode [id=7a62d367-a907-43c2-90b4-53d15ec30a91, consistentId=10.10.232.6,127.0.0.1,152.16.11.67:47500, addrs=ArrayList [10.10.232.6, 127.0.0.1, 152.16.11.67], sockAddrs=HashSet [/127.0.0.1:47500, mzitsme1-nick-p1.myarea.example.com/10.10.232.6:47500, itsme1-nick-p1.myarea.example.com/152.16.11.67:47500], discPort=47500, order=5, intOrder=5, lastExchangeTime=1606264503158, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false], done=true, newCrdFut=null], topVer=null]
2020-11-24 18:42:09 WARNING [] GridDhtAtomicCache:295 - <MY_CACHE> Failed to update key on backup (local node is stopping): KeyCacheObjectImpl [part=377, val=com.example.MyCache, hasValBytes=true]
2020-11-24 18:42:09 WARNING [] GridDhtAtomicCache:295 - <MY_CACHE> Failed to update key on backup (local node is stopping): KeyCacheObjectImpl [part=377, val=com.example.MyCache, hasValBytes=true]
2020-11-24 18:42:09 WARNING [] GridDhtAtomicCache:295 - <MY_CACHE> Failed to update key on backup (local node is stopping): KeyCacheObjectImpl [part=377, val=com.example.MyCache, hasValBytes=true]
2020-11-24 18:42:09 SEVERE [] GridDhtAtomicCache:310 - <MYCACHE2> Unexpected exception during cache update: class org.apache.ignite.IgniteException: Failed to resolve nodes topology [cacheGrp=CACHE_MY_CACHE, topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], history=[AffinityTopologyVersion [topVer=4, minorTopVer=0], AffinityTopologyVersion [topVer=5, minorTopVer=0], AffinityTopologyVersion [topVer=6, minorTopVer=0], AffinityTopologyVersion [topVer=7, minorTopVer=0], AffinityTopologyVersion [topVer=8, minorTopVer=0]], snap=Snapshot [topVer=AffinityTopologyVersion [topVer=8, minorTopVer=0]], locNode=TcpDiscoveryNode [id=4a0b2901-adc1-4416-8345-82caa6a18cea, consistentId=10.10.232.14,127.0.0.1,152.16.11.75:47500, addrs=ArrayList [10.10.232.14, 127.0.0.1, 152.16.11.75], sockAddrs=HashSet [/127.0.0.1:47500, mzitsme4-nick.myarea.example.com/10.10.232.14:47500, itsme4-nick.myarea.example.com/152.16.11.75:47500], discPort=47500, order=4, intOrder=4, lastExchangeTime=1606264871091, loc=true, ver=2.8.1#20200521-sha1:86422096, isClient=false]]
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.resolveDiscoCache(GridDiscoveryManager.java:1999)
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.cacheGroupAffinityNodes(GridDiscoveryManager.java:1881)
    at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheAdapter.needRemap(GridDhtCacheAdapter.java:1297)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1850)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1719)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3306)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access0(GridDhtAtomicCache.java:141)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.apply(GridDhtAtomicCache.java:273)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.apply(GridDhtAtomicCache.java:268)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access0(GridCacheIoManager.java:109)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.run(GridCacheIoManager.java:288)
    at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:565)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at java.lang.Thread.run(Thread.java:748)

下面是我的缓存配置代码。所有属性的默认值都是我们当前使用的。

    @PostConstruct
    public void init() {
        if (!CACHING_ENABLED) {
            LOGGER.warn("Caching is currently disabled because {} is not set to Y in the properties files!!!", Constants.PROPERTY_CACHING_ENABLED);
            return;
        }
        try {
            System.setProperty("IGNITE_UPDATE_NOTIFIER", "false");
            
            igniteConfiguration = new IgniteConfiguration();
            
            int failureDetectionTimeout = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_FAILURE_DETECTION_TIMEOUT", "60000"));
            
            igniteConfiguration.setFailureDetectionTimeout(failureDetectionTimeout);
            String igniteCacheStorageDirectory = getProperty("IGNITE_CACHE_STORAGE_DIRECTORY");
            if (StringUtils.isNotBlank(igniteCacheStorageDirectory)) {
                DataStorageConfiguration dsCfg = new DataStorageConfiguration();
                DataRegionConfiguration dfltDataRegConf = new DataRegionConfiguration();
                dfltDataRegConf.setPersistenceEnabled(true);
                dsCfg.setDefaultDataRegionConfiguration(dfltDataRegConf);
                dsCfg.setStoragePath(igniteCacheStorageDirectory);
                igniteConfiguration.setDataStorageConfiguration(dsCfg); 
            }

            String igniteVmIps = getProperty("IGNITE_VM_IPS");
            List<String> addresses = Arrays.asList("127.0.0.1:47500");
            if (StringUtils.isNotBlank(igniteVmIps)) {
                addresses = Arrays.asList(igniteVmIps.split(","));
            }
            
            int networkTimeout = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_NETWORK_TIMEOUT", "60000"));
            boolean failureDetectionTimeoutEnabled = Boolean.parseBoolean(getProperty("IGNITE_TCP_DISCOVERY_FAILURE_DETECTION_TIMEOUT_ENABLED", "true"));
            
            int tcpDiscoveryLocalPort = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_LOCAL_PORT", "47500"));
            int tcpDiscoveryLocalPortRange = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_LOCAL_PORT_RANGE", "0"));
            
            TcpDiscoverySpi tcpDiscoverySpi = new TcpDiscoverySpi();
            tcpDiscoverySpi.setLocalPort(tcpDiscoveryLocalPort);
            tcpDiscoverySpi.setLocalPortRange(tcpDiscoveryLocalPortRange);
            tcpDiscoverySpi.setNetworkTimeout(networkTimeout);
            tcpDiscoverySpi.failureDetectionTimeoutEnabled(failureDetectionTimeoutEnabled);
            TcpDiscoveryVmIpFinder ipFinder = new TcpDiscoveryVmIpFinder();
            ipFinder.setAddresses(addresses);
            tcpDiscoverySpi.setIpFinder(ipFinder);
            
            int messageQueueLimit = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_MESSAGE_QUEUE_LIMIT", "1000"));
            
            TcpCommunicationSpi tcpCommunicationSpi = new TcpCommunicationSpi();
            tcpCommunicationSpi.setMessageQueueLimit(messageQueueLimit);

            igniteConfiguration.setDiscoverySpi(tcpDiscoverySpi);
            igniteConfiguration.setCommunicationSpi(tcpCommunicationSpi);
            isInit = true;
        } catch (Exception e) {
            LOGGER.error("Could not initialize cache! Cache services will be unavailable!", e);
            isInit = false;
        }
    }

很遗憾,我不能分享完整的日志。我可以研究任何提示或技巧来消除此错误吗?

我看到有人提到将 ack 超时设置为更高的值。否则,论坛没有提供很多关于在这里做什么的提示。

好的,我想我们已经解决了这个问题。注意上面在 tcp 发现期间如何找到多个 NIC。这是因为我的 JBoss 服务器有 2 个网络接口,一个用于我的 LAN 10.10.232.6,另一个用于 DMZ 152.16.11.67。但是我集群中的节点只能通过我的 LAN IP 相互通信。

我的解决方案是调用 igniteConfiguration.setLocalHost(InetAddress.getLocalHost().getAddress());,而不是绑定到 0.0.0.0,而是绑定到 LAN IP 10.10.232.6。这阻止了 ignite discovery 尝试使用 DMZ NIC。