openjdk zgc 上的高 cpu 使用率
high cpu usage on openjdk zgc
我正在尝试在产品环境中使用 zgc,所以我将 jdk 从 jdk8 更新为 openjdk 15,将 tomcat 8 更新为 tomcat 8.5,和 gc 相关的选项,但是在 jvm 启动几个小时后 cpu 使用率达到 1000+%(正常 cpu 使用率应该是 100-300%)。有时,当cpu为高时,日志文件中有很多ICBufferFull安全点:
[2020-11-12T19:00:43.669+0800] Safepoint "ICBufferFull", Time since last: 41374119 ns, Reaching safepoint: 2026134 ns, At safepoint: 85993 ns, Total: 2112127 ns
[2020-11-12T19:00:43.672+0800] Safepoint "ICBufferFull", Time since last: 2521598 ns, Reaching safepoint: 1109894 ns, At safepoint: 57235 ns, Total: 1167129 ns
[2020-11-12T19:00:43.676+0800] Safepoint "ICBufferFull", Time since last: 2892867 ns, Reaching safepoint: 1240834 ns, At safepoint: 59633 ns, Total: 1300467 ns
[2020-11-12T19:00:43.681+0800] Safepoint "ICBufferFull", Time since last: 2870110 ns, Reaching safepoint: 1425837 ns, At safepoint: 54052 ns, Total: 1479889 ns
如果让节点下线30s左右,cpu掉线,然后再上线,会正常工作几个小时,直到cpu再次变高。
我检查了源代码,ICBufferFull 的意思是“内联缓存缓冲区已满”,但我找不到增加它的选项,有人可以帮忙吗?谢谢!
gc 选项是:
export JAVA_OPTS='-Xms10g -Xmx10g -XX:+UseLargePages -XX:ZAllocationSpikeTolerance=5 -XX:ParallelGCThreads=8 -XX:ConcGCThreads=4 -Xss2m -XX:+UseZGC -Xlog:gc,gc+phases,safepoint:file=/logs/gc.log:t:filecount=10,filesize=10m -XX:+HeapDumpOnOutOfMemoryError'
编辑1:
我有另一个主机在jdk8中仍然运行良好运行,以cms为参考,对2个主机的请求几乎相同。
我用async-profile分析了一下,最热的方法是java/lang/ThreadLocal$ThreadLocalMap.getEntryAfterMiss
,出现了50+%,最热的native方法是ZMark::try_mark_object(ZMarkCache*, unsigned long, bool)
,只出现了0.41%。我查看了jdk8和openjdk15中thread local相关的源码,好像没变。
编辑2:
我在JIRA上发现了一个类似的bug,我的应用也是lucene相关的,从gc日志来看,弱引用计数在1m+时,使用率很高。所以问题是如何在 zgc 中更积极地收集弱引用?
edit3:
从以下每 3 秒调用一次 System.gc() 的日志来看,我的应用程序似乎产生了过多的弱引用。但奇怪的是,生产的数量在开始后一直在增加。从 11:00-17:00 开始,请求几乎保持不变。
请注意,cpu 在 GC(9821) 后自动从 600% 下降到 400%,排队的是 ~250K。 GC(10265) 节点脱机,入队约为 770K。
为什么enqueued count 长时间变小,是不是表示对象没有完全回收?
[2020-11-19T11:00:00.245+0800] GC(992) Weak: 155658 encountered, 72334 discovered, 0 enqueued
[2020-11-19T12:00:00.397+0800] GC(2194) Weak: 220462 encountered, 122216 discovered, 1380 enqueued
[2020-11-19T12:00:03.411+0800] GC(2195) Weak: 220598 encountered, 107228 discovered, 677 enqueued
[2020-11-19T13:00:00.497+0800] GC(3395) Weak: 222536 encountered, 82199 discovered, 1713 enqueued
[2020-11-19T14:00:00.647+0800] GC(4613) Weak: 443946 encountered, 291651 discovered, 292 enqueued
[2020-11-19T15:00:01.173+0800] GC(5819) Weak: 338065 encountered, 124351 discovered, 815 enqueued
[2020-11-19T16:00:01.283+0800] GC(7022) Weak: 459070 encountered, 298932 discovered, 353 enqueued
[2020-11-19T17:00:01.426+0800] GC(8222) Weak: 688162 encountered, 519369 discovered, 4648 enqueued
[2020-11-19T16:00:01.283+0800] GC(7022) Weak: 459070 encountered, 298932 discovered, 353 enqueued
[2020-11-19T17:00:01.426+0800] GC(8222) Weak: 688162 encountered, 519369 discovered, 4648 enqueued
[2020-11-19T18:00:01.556+0800] GC(9430) Weak: 1078757 encountered, 928748 discovered, 1691 enqueued
[2020-11-19T18:18:43.595+0800] GC(9821) Weak: 1022080 encountered, 841168 discovered, 247352 enqueued
[2020-11-19T18:18:46.592+0800] GC(9822) Weak: 774253 encountered, 568564 discovered, 3938 enqueued
[2020-11-19T18:40:49.616+0800] GC(10265) Weak: 842081 encountered, 788825 discovered, 767288 enqueued
[2020-11-19T18:40:52.593+0800] GC(10266) Weak: 74876 encountered, 18186 discovered, 1 enqueued
现代 GC 会延迟回收弱可达对象。
由 System.gc()
触发的集合始终处理弱可达对象,并且默认情况下是并发的,因此您可以实现调用该方法的周期性任务。
这个问题终于变成了jdk issue, which supposed to be resolved in jdk16, and could be bypassed by making a thread pool which will terminate old thread and generate new thread periodically. I moved my application to jetty for some reason, the modified jetty thread pool is here。现在它可以完美运行几天,如果有人遇到同样的问题,请采纳它。
我正在尝试在产品环境中使用 zgc,所以我将 jdk 从 jdk8 更新为 openjdk 15,将 tomcat 8 更新为 tomcat 8.5,和 gc 相关的选项,但是在 jvm 启动几个小时后 cpu 使用率达到 1000+%(正常 cpu 使用率应该是 100-300%)。有时,当cpu为高时,日志文件中有很多ICBufferFull安全点:
[2020-11-12T19:00:43.669+0800] Safepoint "ICBufferFull", Time since last: 41374119 ns, Reaching safepoint: 2026134 ns, At safepoint: 85993 ns, Total: 2112127 ns
[2020-11-12T19:00:43.672+0800] Safepoint "ICBufferFull", Time since last: 2521598 ns, Reaching safepoint: 1109894 ns, At safepoint: 57235 ns, Total: 1167129 ns
[2020-11-12T19:00:43.676+0800] Safepoint "ICBufferFull", Time since last: 2892867 ns, Reaching safepoint: 1240834 ns, At safepoint: 59633 ns, Total: 1300467 ns
[2020-11-12T19:00:43.681+0800] Safepoint "ICBufferFull", Time since last: 2870110 ns, Reaching safepoint: 1425837 ns, At safepoint: 54052 ns, Total: 1479889 ns
如果让节点下线30s左右,cpu掉线,然后再上线,会正常工作几个小时,直到cpu再次变高。 我检查了源代码,ICBufferFull 的意思是“内联缓存缓冲区已满”,但我找不到增加它的选项,有人可以帮忙吗?谢谢!
gc 选项是:
export JAVA_OPTS='-Xms10g -Xmx10g -XX:+UseLargePages -XX:ZAllocationSpikeTolerance=5 -XX:ParallelGCThreads=8 -XX:ConcGCThreads=4 -Xss2m -XX:+UseZGC -Xlog:gc,gc+phases,safepoint:file=/logs/gc.log:t:filecount=10,filesize=10m -XX:+HeapDumpOnOutOfMemoryError'
编辑1:
我有另一个主机在jdk8中仍然运行良好运行,以cms为参考,对2个主机的请求几乎相同。
我用async-profile分析了一下,最热的方法是java/lang/ThreadLocal$ThreadLocalMap.getEntryAfterMiss
,出现了50+%,最热的native方法是ZMark::try_mark_object(ZMarkCache*, unsigned long, bool)
,只出现了0.41%。我查看了jdk8和openjdk15中thread local相关的源码,好像没变。
编辑2:
我在JIRA上发现了一个类似的bug,我的应用也是lucene相关的,从gc日志来看,弱引用计数在1m+时,使用率很高。所以问题是如何在 zgc 中更积极地收集弱引用?
edit3:
从以下每 3 秒调用一次 System.gc() 的日志来看,我的应用程序似乎产生了过多的弱引用。但奇怪的是,生产的数量在开始后一直在增加。从 11:00-17:00 开始,请求几乎保持不变。 请注意,cpu 在 GC(9821) 后自动从 600% 下降到 400%,排队的是 ~250K。 GC(10265) 节点脱机,入队约为 770K。 为什么enqueued count 长时间变小,是不是表示对象没有完全回收?
[2020-11-19T11:00:00.245+0800] GC(992) Weak: 155658 encountered, 72334 discovered, 0 enqueued
[2020-11-19T12:00:00.397+0800] GC(2194) Weak: 220462 encountered, 122216 discovered, 1380 enqueued
[2020-11-19T12:00:03.411+0800] GC(2195) Weak: 220598 encountered, 107228 discovered, 677 enqueued
[2020-11-19T13:00:00.497+0800] GC(3395) Weak: 222536 encountered, 82199 discovered, 1713 enqueued
[2020-11-19T14:00:00.647+0800] GC(4613) Weak: 443946 encountered, 291651 discovered, 292 enqueued
[2020-11-19T15:00:01.173+0800] GC(5819) Weak: 338065 encountered, 124351 discovered, 815 enqueued
[2020-11-19T16:00:01.283+0800] GC(7022) Weak: 459070 encountered, 298932 discovered, 353 enqueued
[2020-11-19T17:00:01.426+0800] GC(8222) Weak: 688162 encountered, 519369 discovered, 4648 enqueued
[2020-11-19T16:00:01.283+0800] GC(7022) Weak: 459070 encountered, 298932 discovered, 353 enqueued
[2020-11-19T17:00:01.426+0800] GC(8222) Weak: 688162 encountered, 519369 discovered, 4648 enqueued
[2020-11-19T18:00:01.556+0800] GC(9430) Weak: 1078757 encountered, 928748 discovered, 1691 enqueued
[2020-11-19T18:18:43.595+0800] GC(9821) Weak: 1022080 encountered, 841168 discovered, 247352 enqueued
[2020-11-19T18:18:46.592+0800] GC(9822) Weak: 774253 encountered, 568564 discovered, 3938 enqueued
[2020-11-19T18:40:49.616+0800] GC(10265) Weak: 842081 encountered, 788825 discovered, 767288 enqueued
[2020-11-19T18:40:52.593+0800] GC(10266) Weak: 74876 encountered, 18186 discovered, 1 enqueued
现代 GC 会延迟回收弱可达对象。
由 System.gc()
触发的集合始终处理弱可达对象,并且默认情况下是并发的,因此您可以实现调用该方法的周期性任务。
这个问题终于变成了jdk issue, which supposed to be resolved in jdk16, and could be bypassed by making a thread pool which will terminate old thread and generate new thread periodically. I moved my application to jetty for some reason, the modified jetty thread pool is here。现在它可以完美运行几天,如果有人遇到同样的问题,请采纳它。