Hazelcast Jet 卡在启动 Job 上
Hazelcast Jet stuck on starting Job
我在 Hazelcast Jet 中遇到了奇怪的行为。我同时开始了很多工作(约 30 个,有些工作稍微先于其他工作)。但是,当我的 Hazelcast Jet 作业数达到 26(神奇数字?)时,所有处理都卡住了。
在线程堆中我看到以下信息:
"hz._hzInstance_1_jet.cached.thread-1" #37 prio=5 os_prio=0 cpu=1093.29ms elapsed=393.62s tid=0x00007f95dc007000 nid=0x6bfc in Object.wait() [0x00007f95e6af4000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base@11.0.2/Native Method)
- waiting on <no object reference available>
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:229)
- waiting to re-lock in wait() <0x00000007864b7040> (a com.hazelcast.internal.util.SimpleCompletableFuture)
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:191)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:88)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:385)
at com.hazelcast.map.impl.proxy.MapProxySupport.clearInternal(MapProxySupport.java:1016)
at com.hazelcast.map.impl.proxy.MapProxyImpl.clearInternal(MapProxyImpl.java:109)
at com.hazelcast.map.impl.proxy.MapProxyImpl.clear(MapProxyImpl.java:698)
at com.hazelcast.jet.impl.JobRepository.clearSnapshotData(JobRepository.java:464)
at com.hazelcast.jet.impl.MasterJobContext.tryStartJob(MasterJobContext.java:233)
at com.hazelcast.jet.impl.JobCoordinationService.tryStartJob(JobCoordinationService.java:776)
at com.hazelcast.jet.impl.JobCoordinationService.lambda$submitJob[=10=](JobCoordinationService.java:200)
at com.hazelcast.jet.impl.JobCoordinationService$$Lambda4/0x00000008009ce840.run(Unknown Source)
还有:
"hz._hzInstance_1_jet.async.thread-2" #81 prio=5 os_prio=0 cpu=0.00ms elapsed=661.98s tid=0x0000025bb23ef000 nid=0x43bc in Object.wait() [0x0000005d492fe000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base@11/Native Method)
- waiting on <no object reference available>
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:229)
- waiting to re-lock in wait() <0x0000000725600100> (a com.hazelcast.internal.util.SimpleCompletableFuture)
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:191)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:88)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:385)
at com.hazelcast.map.impl.proxy.MapProxySupport.removeAllInternal(MapProxySupport.java:619)
at com.hazelcast.map.impl.proxy.MapProxyImpl.removeAll(MapProxyImpl.java:285)
at com.hazelcast.jet.impl.JobRepository.deleteJob(JobRepository.java:332)
at com.hazelcast.jet.impl.JobRepository.completeJob(JobRepository.java:316)
at com.hazelcast.jet.impl.JobCoordinationService.completeJob(JobCoordinationService.java:576)
at com.hazelcast.jet.impl.MasterJobContext.lambda$finalizeJob(MasterJobContext.java:620)
at com.hazelcast.jet.impl.MasterJobContext$$Lambda3/0x0000000800b26840.run(Unknown Source)
at com.hazelcast.jet.impl.MasterJobContext.finalizeJob(MasterJobContext.java:632)
at com.hazelcast.jet.impl.MasterJobContext.onCompleteExecutionCompleted(MasterJobContext.java:564)
at com.hazelcast.jet.impl.MasterJobContext.lambda$invokeCompleteExecution(MasterJobContext.java:544)
at com.hazelcast.jet.impl.MasterJobContext$$Lambda9/0x0000000800b27840.accept(Unknown Source)
at com.hazelcast.jet.impl.MasterContext.lambda$invokeOnParticipants[=11=](MasterContext.java:242)
at com.hazelcast.jet.impl.MasterContext$$Lambda6/0x0000000800a1c040.accept(Unknown Source)
at com.hazelcast.jet.impl.util.Util.onResponse(Util.java:172)
at com.hazelcast.spi.impl.AbstractInvocationFuture.run(AbstractInvocationFuture.java:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11/Thread.java:834)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
我不知道如何重现这个问题,但我希望有人知道如何解决这个问题,或者我的问题会帮助其他人:)
我的设置:
- Java 11
- Hazelcast 3.12 快照
- Hazelcast Jet 3.0 快照(我无法恢复到以前的版本,它会打破我的逻辑;我需要 n:m 加入,这将在 3.1 中添加)
- CPU 核心:4
- 内存:7 GB
- Jet 模式:服务器,作为客户端连接到其他集群以插入最终数据。
有没有人遇到过类似的问题?问题是,它不能简单地复制,因此很难为 Hazelcast 团队制造问题。只有线程转储和一般行为可以提示正在发生的事情。
这是开发期间 3.0-SNAPSHOT 中的一个问题,在 3.0 版本中 fixed。
我在 Hazelcast Jet 中遇到了奇怪的行为。我同时开始了很多工作(约 30 个,有些工作稍微先于其他工作)。但是,当我的 Hazelcast Jet 作业数达到 26(神奇数字?)时,所有处理都卡住了。
在线程堆中我看到以下信息:
"hz._hzInstance_1_jet.cached.thread-1" #37 prio=5 os_prio=0 cpu=1093.29ms elapsed=393.62s tid=0x00007f95dc007000 nid=0x6bfc in Object.wait() [0x00007f95e6af4000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base@11.0.2/Native Method)
- waiting on <no object reference available>
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:229)
- waiting to re-lock in wait() <0x00000007864b7040> (a com.hazelcast.internal.util.SimpleCompletableFuture)
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:191)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:88)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:385)
at com.hazelcast.map.impl.proxy.MapProxySupport.clearInternal(MapProxySupport.java:1016)
at com.hazelcast.map.impl.proxy.MapProxyImpl.clearInternal(MapProxyImpl.java:109)
at com.hazelcast.map.impl.proxy.MapProxyImpl.clear(MapProxyImpl.java:698)
at com.hazelcast.jet.impl.JobRepository.clearSnapshotData(JobRepository.java:464)
at com.hazelcast.jet.impl.MasterJobContext.tryStartJob(MasterJobContext.java:233)
at com.hazelcast.jet.impl.JobCoordinationService.tryStartJob(JobCoordinationService.java:776)
at com.hazelcast.jet.impl.JobCoordinationService.lambda$submitJob[=10=](JobCoordinationService.java:200)
at com.hazelcast.jet.impl.JobCoordinationService$$Lambda4/0x00000008009ce840.run(Unknown Source)
还有:
"hz._hzInstance_1_jet.async.thread-2" #81 prio=5 os_prio=0 cpu=0.00ms elapsed=661.98s tid=0x0000025bb23ef000 nid=0x43bc in Object.wait() [0x0000005d492fe000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base@11/Native Method)
- waiting on <no object reference available>
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:229)
- waiting to re-lock in wait() <0x0000000725600100> (a com.hazelcast.internal.util.SimpleCompletableFuture)
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:191)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:88)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:385)
at com.hazelcast.map.impl.proxy.MapProxySupport.removeAllInternal(MapProxySupport.java:619)
at com.hazelcast.map.impl.proxy.MapProxyImpl.removeAll(MapProxyImpl.java:285)
at com.hazelcast.jet.impl.JobRepository.deleteJob(JobRepository.java:332)
at com.hazelcast.jet.impl.JobRepository.completeJob(JobRepository.java:316)
at com.hazelcast.jet.impl.JobCoordinationService.completeJob(JobCoordinationService.java:576)
at com.hazelcast.jet.impl.MasterJobContext.lambda$finalizeJob(MasterJobContext.java:620)
at com.hazelcast.jet.impl.MasterJobContext$$Lambda3/0x0000000800b26840.run(Unknown Source)
at com.hazelcast.jet.impl.MasterJobContext.finalizeJob(MasterJobContext.java:632)
at com.hazelcast.jet.impl.MasterJobContext.onCompleteExecutionCompleted(MasterJobContext.java:564)
at com.hazelcast.jet.impl.MasterJobContext.lambda$invokeCompleteExecution(MasterJobContext.java:544)
at com.hazelcast.jet.impl.MasterJobContext$$Lambda9/0x0000000800b27840.accept(Unknown Source)
at com.hazelcast.jet.impl.MasterContext.lambda$invokeOnParticipants[=11=](MasterContext.java:242)
at com.hazelcast.jet.impl.MasterContext$$Lambda6/0x0000000800a1c040.accept(Unknown Source)
at com.hazelcast.jet.impl.util.Util.onResponse(Util.java:172)
at com.hazelcast.spi.impl.AbstractInvocationFuture.run(AbstractInvocationFuture.java:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11/Thread.java:834)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
我不知道如何重现这个问题,但我希望有人知道如何解决这个问题,或者我的问题会帮助其他人:)
我的设置: - Java 11 - Hazelcast 3.12 快照 - Hazelcast Jet 3.0 快照(我无法恢复到以前的版本,它会打破我的逻辑;我需要 n:m 加入,这将在 3.1 中添加) - CPU 核心:4 - 内存:7 GB - Jet 模式:服务器,作为客户端连接到其他集群以插入最终数据。
有没有人遇到过类似的问题?问题是,它不能简单地复制,因此很难为 Hazelcast 团队制造问题。只有线程转储和一般行为可以提示正在发生的事情。
这是开发期间 3.0-SNAPSHOT 中的一个问题,在 3.0 版本中 fixed。