重新启动客户端节点时,Ignite Cluster 变得无响应

Ignite Cluster becomes unresponsive when relaunching client nodes

我们在 k8tes 设置中间歇性地看到以下错误。 在我们重新启动启动新 Ignite 客户端节点的 tomcat pod 后会发生此问题。

我了解第一个堆栈跟踪显示 Ignite 已检测到 tcp 通信 spi 已变得无响应,但我不明白这与第二个堆栈跟踪有何关系。这似乎是两个完全不相关的错误,但第二个错误表示线程转储与第一个错误具有相同的时间戳。 Thread dump at 2021/10/12 15:57:17

可以通过关闭所有 Ignite pods 并重新启动它们来解决此问题,但是需要更好地理解此问题以及不需要重新启动 Ignite 的方法。

12-Oct-2021 15:57:17.139 WARNING [grid-timeout-worker-#134%igniteClientInstance%] org.apache.ignite.logger.java.JavaLogger.warning Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=tcp-comm-worker, igniteInstanceName=igniteClientInstance, finished=false, heartbeatTs=1634054222218]]]
class org.apache.ignite.IgniteException: GridWorker [name=tcp-comm-worker, igniteInstanceName=igniteClientInstance, finished=false, heartbeatTs=1634054222218]
        at java.base/sun.nio.ch.Net.poll(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.pollConnected(SocketChannelImpl.java:991)
        at java.base/sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:119)
        at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:465)
        at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:691)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:1255)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$$Lambda9/0x0000000012e5ffc0.apply(Unknown Source)
        at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:689)
        at org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.createCommunicationClient(ConnectionClientPool.java:453)
        at org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.reserveClient(ConnectionClientPool.java:228)
        at org.apache.ignite.spi.communication.tcp.internal.CommunicationWorker.processDisconnect(CommunicationWorker.java:374)
        at org.apache.ignite.spi.communication.tcp.internal.CommunicationWorker.body(CommunicationWorker.java:174)
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.body(TcpCommunicationSpi.java:923)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
12-Oct-2021 15:57:17.141 WARNING [grid-timeout-worker-#134%igniteClientInstance%] org.apache.ignite.logger.java.JavaLogger.warning No deadlocked threads detected.
12-Oct-2021 15:57:17.170 WARNING [grid-timeout-worker-#134%igniteClientInstance%] org.apache.ignite.logger.java.JavaLogger.warning Thread dump at 2021/10/12 15:57:17 GMT
Thread [name="main", id=1, state=RUNNABLE, blockCnt=19, waitCnt=416]
        at java.base/java.net.SocketInputStream.socketRead0(Native Method)
        at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
        at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
        at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
        - locked java.io.BufferedInputStream@263909ea
        at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:256)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1163)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:188)
        - locked org.postgresql.core.v3.QueryExecutorImpl@1b338a37
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:437)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:353)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:257)
        at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery(NewProxyPreparedStatement.java:116)
        at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(ResultSetReturnImpl.java:70)
        at org.hibernate.loader.Loader.getResultSet(Loader.java:2123)
        at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1911)
        at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1887)
        at org.hibernate.loader.Loader.doQuery(Loader.java:932)
        at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:349)
        at org.hibernate.loader.Loader.doList(Loader.java:2615)
        at org.hibernate.loader.Loader.doList(Loader.java:2598)
        at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2430)
        at org.hibernate.loader.Loader.list(Loader.java:2425)
        at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:502)
        at org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:370)
        at org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:216)
        at org.hibernate.internal.SessionImpl.list(SessionImpl.java:1481)
        at org.hibernate.query.internal.AbstractProducedQuery.doList(AbstractProducedQuery.java:1441)
        at org.hibernate.query.internal.AbstractProducedQuery.list(AbstractProducedQuery.java:1410)
        at org.hibernate.Query.getResultList(Query.java:427)
        at com.foo.dao.hibernate.report.FooBarImpl.retrieveFoo(FooBarImpl.java:61)
        at jdk.internal.reflect.GeneratedMethodAccessor513.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

当 Ignite 通过 FailureHandler 失败时,它会生成所有线程的线程转储(如果需要,用于分析)。您的第二个堆栈跟踪看起来像是线程转储的一部分。