Apache Ignite 不会以 6 节点集群启动 - 无法解析节点拓扑
Apache Ignite will not start with 6 node cluster - Failed to resolve nodes topology
我们有两个系统,一个有 2 个节点的 QA 系统和一个有 6 个节点的 Prod 系统。
QA 系统完美启动。我们有一个工作系统,所以我们提升到生产。
Prod 系统启动并在大约 16 秒后抛出这些错误,none 点燃缓存工作。
2 个节点启动,其他 4 个节点永远无法启动。
在未启动的节点之一上:
点燃消息来自:
2020-11-24 18:30:52 INFO [] stdout:71 - [18:30:52] __________ ________________
2020-11-24 18:30:52 INFO [] stdout:71 - [18:30:52] / _/ ___/ |/ / _/_ __/ __/
2020-11-24 18:30:52 INFO [] stdout:71 - [18:30:52] _/ // (7 7 // / / / / _/
2020-11-24 18:30:52 INFO [] stdout:71 - [18:30:52] /___/\___/_/|_/___/ /_/ /___/
并且在 2020-11-24 18:42:09
我们得到以下错误(清理数据):
2020-11-24 18:42:09 INFO [] GridTcpRestProtocol:285 - Command protocol successfully stopped: TCP binary
2020-11-24 18:42:09 INFO [] GridDhtPartitionsExchangeFuture:285 - Finish exchange future [startVer=AffinityTopologyVersion [topVer=8, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.NodeStoppingException: Node is stopping: null, rebalanced=false, wasRebalanced=false]
2020-11-24 18:42:09 INFO [] GridDhtPartitionsExchangeFuture:285 - Completed partition exchange [localNode=4a0b2901-adc1-4416-8345-82caa6a18cea, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=8, minorTopVer=0], evt=NODE_LEFT, evtNode=TcpDiscoveryNode [id=7a62d367-a907-43c2-90b4-53d15ec30a91, consistentId=10.10.232.6,127.0.0.1,152.16.11.67:47500, addrs=ArrayList [10.10.232.6, 127.0.0.1, 152.16.11.67], sockAddrs=HashSet [/127.0.0.1:47500, mzitsme1-nick-p1.myarea.example.com/10.10.232.6:47500, itsme1-nick-p1.myarea.example.com/152.16.11.67:47500], discPort=47500, order=5, intOrder=5, lastExchangeTime=1606264503158, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false], done=true, newCrdFut=null], topVer=null]
2020-11-24 18:42:09 WARNING [] GridDhtAtomicCache:295 - <MY_CACHE> Failed to update key on backup (local node is stopping): KeyCacheObjectImpl [part=377, val=com.example.MyCache, hasValBytes=true]
2020-11-24 18:42:09 WARNING [] GridDhtAtomicCache:295 - <MY_CACHE> Failed to update key on backup (local node is stopping): KeyCacheObjectImpl [part=377, val=com.example.MyCache, hasValBytes=true]
2020-11-24 18:42:09 WARNING [] GridDhtAtomicCache:295 - <MY_CACHE> Failed to update key on backup (local node is stopping): KeyCacheObjectImpl [part=377, val=com.example.MyCache, hasValBytes=true]
2020-11-24 18:42:09 SEVERE [] GridDhtAtomicCache:310 - <MYCACHE2> Unexpected exception during cache update: class org.apache.ignite.IgniteException: Failed to resolve nodes topology [cacheGrp=CACHE_MY_CACHE, topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], history=[AffinityTopologyVersion [topVer=4, minorTopVer=0], AffinityTopologyVersion [topVer=5, minorTopVer=0], AffinityTopologyVersion [topVer=6, minorTopVer=0], AffinityTopologyVersion [topVer=7, minorTopVer=0], AffinityTopologyVersion [topVer=8, minorTopVer=0]], snap=Snapshot [topVer=AffinityTopologyVersion [topVer=8, minorTopVer=0]], locNode=TcpDiscoveryNode [id=4a0b2901-adc1-4416-8345-82caa6a18cea, consistentId=10.10.232.14,127.0.0.1,152.16.11.75:47500, addrs=ArrayList [10.10.232.14, 127.0.0.1, 152.16.11.75], sockAddrs=HashSet [/127.0.0.1:47500, mzitsme4-nick.myarea.example.com/10.10.232.14:47500, itsme4-nick.myarea.example.com/152.16.11.75:47500], discPort=47500, order=4, intOrder=4, lastExchangeTime=1606264871091, loc=true, ver=2.8.1#20200521-sha1:86422096, isClient=false]]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.resolveDiscoCache(GridDiscoveryManager.java:1999)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.cacheGroupAffinityNodes(GridDiscoveryManager.java:1881)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheAdapter.needRemap(GridDhtCacheAdapter.java:1297)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1850)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1719)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3306)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access0(GridDhtAtomicCache.java:141)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.apply(GridDhtAtomicCache.java:273)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.apply(GridDhtAtomicCache.java:268)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access0(GridCacheIoManager.java:109)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.run(GridCacheIoManager.java:288)
at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:565)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
下面是我的缓存配置代码。所有属性的默认值都是我们当前使用的。
@PostConstruct
public void init() {
if (!CACHING_ENABLED) {
LOGGER.warn("Caching is currently disabled because {} is not set to Y in the properties files!!!", Constants.PROPERTY_CACHING_ENABLED);
return;
}
try {
System.setProperty("IGNITE_UPDATE_NOTIFIER", "false");
igniteConfiguration = new IgniteConfiguration();
int failureDetectionTimeout = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_FAILURE_DETECTION_TIMEOUT", "60000"));
igniteConfiguration.setFailureDetectionTimeout(failureDetectionTimeout);
String igniteCacheStorageDirectory = getProperty("IGNITE_CACHE_STORAGE_DIRECTORY");
if (StringUtils.isNotBlank(igniteCacheStorageDirectory)) {
DataStorageConfiguration dsCfg = new DataStorageConfiguration();
DataRegionConfiguration dfltDataRegConf = new DataRegionConfiguration();
dfltDataRegConf.setPersistenceEnabled(true);
dsCfg.setDefaultDataRegionConfiguration(dfltDataRegConf);
dsCfg.setStoragePath(igniteCacheStorageDirectory);
igniteConfiguration.setDataStorageConfiguration(dsCfg);
}
String igniteVmIps = getProperty("IGNITE_VM_IPS");
List<String> addresses = Arrays.asList("127.0.0.1:47500");
if (StringUtils.isNotBlank(igniteVmIps)) {
addresses = Arrays.asList(igniteVmIps.split(","));
}
int networkTimeout = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_NETWORK_TIMEOUT", "60000"));
boolean failureDetectionTimeoutEnabled = Boolean.parseBoolean(getProperty("IGNITE_TCP_DISCOVERY_FAILURE_DETECTION_TIMEOUT_ENABLED", "true"));
int tcpDiscoveryLocalPort = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_LOCAL_PORT", "47500"));
int tcpDiscoveryLocalPortRange = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_LOCAL_PORT_RANGE", "0"));
TcpDiscoverySpi tcpDiscoverySpi = new TcpDiscoverySpi();
tcpDiscoverySpi.setLocalPort(tcpDiscoveryLocalPort);
tcpDiscoverySpi.setLocalPortRange(tcpDiscoveryLocalPortRange);
tcpDiscoverySpi.setNetworkTimeout(networkTimeout);
tcpDiscoverySpi.failureDetectionTimeoutEnabled(failureDetectionTimeoutEnabled);
TcpDiscoveryVmIpFinder ipFinder = new TcpDiscoveryVmIpFinder();
ipFinder.setAddresses(addresses);
tcpDiscoverySpi.setIpFinder(ipFinder);
int messageQueueLimit = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_MESSAGE_QUEUE_LIMIT", "1000"));
TcpCommunicationSpi tcpCommunicationSpi = new TcpCommunicationSpi();
tcpCommunicationSpi.setMessageQueueLimit(messageQueueLimit);
igniteConfiguration.setDiscoverySpi(tcpDiscoverySpi);
igniteConfiguration.setCommunicationSpi(tcpCommunicationSpi);
isInit = true;
} catch (Exception e) {
LOGGER.error("Could not initialize cache! Cache services will be unavailable!", e);
isInit = false;
}
}
很遗憾,我不能分享完整的日志。我可以研究任何提示或技巧来消除此错误吗?
我看到有人提到将 ack 超时设置为更高的值。否则,论坛没有提供很多关于在这里做什么的提示。
好的,我想我们已经解决了这个问题。注意上面在 tcp 发现期间如何找到多个 NIC。这是因为我的 JBoss 服务器有 2 个网络接口,一个用于我的 LAN 10.10.232.6
,另一个用于 DMZ 152.16.11.67
。但是我集群中的节点只能通过我的 LAN IP 相互通信。
我的解决方案是调用 igniteConfiguration.setLocalHost(InetAddress.getLocalHost().getAddress());
,而不是绑定到 0.0.0.0
,而是绑定到 LAN IP 10.10.232.6
。这阻止了 ignite discovery 尝试使用 DMZ NIC。
我们有两个系统,一个有 2 个节点的 QA 系统和一个有 6 个节点的 Prod 系统。
QA 系统完美启动。我们有一个工作系统,所以我们提升到生产。
Prod 系统启动并在大约 16 秒后抛出这些错误,none 点燃缓存工作。
2 个节点启动,其他 4 个节点永远无法启动。
在未启动的节点之一上:
点燃消息来自:
2020-11-24 18:30:52 INFO [] stdout:71 - [18:30:52] __________ ________________
2020-11-24 18:30:52 INFO [] stdout:71 - [18:30:52] / _/ ___/ |/ / _/_ __/ __/
2020-11-24 18:30:52 INFO [] stdout:71 - [18:30:52] _/ // (7 7 // / / / / _/
2020-11-24 18:30:52 INFO [] stdout:71 - [18:30:52] /___/\___/_/|_/___/ /_/ /___/
并且在 2020-11-24 18:42:09
我们得到以下错误(清理数据):
2020-11-24 18:42:09 INFO [] GridTcpRestProtocol:285 - Command protocol successfully stopped: TCP binary
2020-11-24 18:42:09 INFO [] GridDhtPartitionsExchangeFuture:285 - Finish exchange future [startVer=AffinityTopologyVersion [topVer=8, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.NodeStoppingException: Node is stopping: null, rebalanced=false, wasRebalanced=false]
2020-11-24 18:42:09 INFO [] GridDhtPartitionsExchangeFuture:285 - Completed partition exchange [localNode=4a0b2901-adc1-4416-8345-82caa6a18cea, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=8, minorTopVer=0], evt=NODE_LEFT, evtNode=TcpDiscoveryNode [id=7a62d367-a907-43c2-90b4-53d15ec30a91, consistentId=10.10.232.6,127.0.0.1,152.16.11.67:47500, addrs=ArrayList [10.10.232.6, 127.0.0.1, 152.16.11.67], sockAddrs=HashSet [/127.0.0.1:47500, mzitsme1-nick-p1.myarea.example.com/10.10.232.6:47500, itsme1-nick-p1.myarea.example.com/152.16.11.67:47500], discPort=47500, order=5, intOrder=5, lastExchangeTime=1606264503158, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false], done=true, newCrdFut=null], topVer=null]
2020-11-24 18:42:09 WARNING [] GridDhtAtomicCache:295 - <MY_CACHE> Failed to update key on backup (local node is stopping): KeyCacheObjectImpl [part=377, val=com.example.MyCache, hasValBytes=true]
2020-11-24 18:42:09 WARNING [] GridDhtAtomicCache:295 - <MY_CACHE> Failed to update key on backup (local node is stopping): KeyCacheObjectImpl [part=377, val=com.example.MyCache, hasValBytes=true]
2020-11-24 18:42:09 WARNING [] GridDhtAtomicCache:295 - <MY_CACHE> Failed to update key on backup (local node is stopping): KeyCacheObjectImpl [part=377, val=com.example.MyCache, hasValBytes=true]
2020-11-24 18:42:09 SEVERE [] GridDhtAtomicCache:310 - <MYCACHE2> Unexpected exception during cache update: class org.apache.ignite.IgniteException: Failed to resolve nodes topology [cacheGrp=CACHE_MY_CACHE, topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], history=[AffinityTopologyVersion [topVer=4, minorTopVer=0], AffinityTopologyVersion [topVer=5, minorTopVer=0], AffinityTopologyVersion [topVer=6, minorTopVer=0], AffinityTopologyVersion [topVer=7, minorTopVer=0], AffinityTopologyVersion [topVer=8, minorTopVer=0]], snap=Snapshot [topVer=AffinityTopologyVersion [topVer=8, minorTopVer=0]], locNode=TcpDiscoveryNode [id=4a0b2901-adc1-4416-8345-82caa6a18cea, consistentId=10.10.232.14,127.0.0.1,152.16.11.75:47500, addrs=ArrayList [10.10.232.14, 127.0.0.1, 152.16.11.75], sockAddrs=HashSet [/127.0.0.1:47500, mzitsme4-nick.myarea.example.com/10.10.232.14:47500, itsme4-nick.myarea.example.com/152.16.11.75:47500], discPort=47500, order=4, intOrder=4, lastExchangeTime=1606264871091, loc=true, ver=2.8.1#20200521-sha1:86422096, isClient=false]]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.resolveDiscoCache(GridDiscoveryManager.java:1999)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.cacheGroupAffinityNodes(GridDiscoveryManager.java:1881)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheAdapter.needRemap(GridDhtCacheAdapter.java:1297)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1850)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1719)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3306)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access0(GridDhtAtomicCache.java:141)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.apply(GridDhtAtomicCache.java:273)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.apply(GridDhtAtomicCache.java:268)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access0(GridCacheIoManager.java:109)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.run(GridCacheIoManager.java:288)
at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:565)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
下面是我的缓存配置代码。所有属性的默认值都是我们当前使用的。
@PostConstruct
public void init() {
if (!CACHING_ENABLED) {
LOGGER.warn("Caching is currently disabled because {} is not set to Y in the properties files!!!", Constants.PROPERTY_CACHING_ENABLED);
return;
}
try {
System.setProperty("IGNITE_UPDATE_NOTIFIER", "false");
igniteConfiguration = new IgniteConfiguration();
int failureDetectionTimeout = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_FAILURE_DETECTION_TIMEOUT", "60000"));
igniteConfiguration.setFailureDetectionTimeout(failureDetectionTimeout);
String igniteCacheStorageDirectory = getProperty("IGNITE_CACHE_STORAGE_DIRECTORY");
if (StringUtils.isNotBlank(igniteCacheStorageDirectory)) {
DataStorageConfiguration dsCfg = new DataStorageConfiguration();
DataRegionConfiguration dfltDataRegConf = new DataRegionConfiguration();
dfltDataRegConf.setPersistenceEnabled(true);
dsCfg.setDefaultDataRegionConfiguration(dfltDataRegConf);
dsCfg.setStoragePath(igniteCacheStorageDirectory);
igniteConfiguration.setDataStorageConfiguration(dsCfg);
}
String igniteVmIps = getProperty("IGNITE_VM_IPS");
List<String> addresses = Arrays.asList("127.0.0.1:47500");
if (StringUtils.isNotBlank(igniteVmIps)) {
addresses = Arrays.asList(igniteVmIps.split(","));
}
int networkTimeout = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_NETWORK_TIMEOUT", "60000"));
boolean failureDetectionTimeoutEnabled = Boolean.parseBoolean(getProperty("IGNITE_TCP_DISCOVERY_FAILURE_DETECTION_TIMEOUT_ENABLED", "true"));
int tcpDiscoveryLocalPort = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_LOCAL_PORT", "47500"));
int tcpDiscoveryLocalPortRange = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_LOCAL_PORT_RANGE", "0"));
TcpDiscoverySpi tcpDiscoverySpi = new TcpDiscoverySpi();
tcpDiscoverySpi.setLocalPort(tcpDiscoveryLocalPort);
tcpDiscoverySpi.setLocalPortRange(tcpDiscoveryLocalPortRange);
tcpDiscoverySpi.setNetworkTimeout(networkTimeout);
tcpDiscoverySpi.failureDetectionTimeoutEnabled(failureDetectionTimeoutEnabled);
TcpDiscoveryVmIpFinder ipFinder = new TcpDiscoveryVmIpFinder();
ipFinder.setAddresses(addresses);
tcpDiscoverySpi.setIpFinder(ipFinder);
int messageQueueLimit = Integer.parseInt(getProperty("IGNITE_TCP_DISCOVERY_MESSAGE_QUEUE_LIMIT", "1000"));
TcpCommunicationSpi tcpCommunicationSpi = new TcpCommunicationSpi();
tcpCommunicationSpi.setMessageQueueLimit(messageQueueLimit);
igniteConfiguration.setDiscoverySpi(tcpDiscoverySpi);
igniteConfiguration.setCommunicationSpi(tcpCommunicationSpi);
isInit = true;
} catch (Exception e) {
LOGGER.error("Could not initialize cache! Cache services will be unavailable!", e);
isInit = false;
}
}
很遗憾,我不能分享完整的日志。我可以研究任何提示或技巧来消除此错误吗?
我看到有人提到将 ack 超时设置为更高的值。否则,论坛没有提供很多关于在这里做什么的提示。
好的,我想我们已经解决了这个问题。注意上面在 tcp 发现期间如何找到多个 NIC。这是因为我的 JBoss 服务器有 2 个网络接口,一个用于我的 LAN 10.10.232.6
,另一个用于 DMZ 152.16.11.67
。但是我集群中的节点只能通过我的 LAN IP 相互通信。
我的解决方案是调用 igniteConfiguration.setLocalHost(InetAddress.getLocalHost().getAddress());
,而不是绑定到 0.0.0.0
,而是绑定到 LAN IP 10.10.232.6
。这阻止了 ignite discovery 尝试使用 DMZ NIC。