Java 无锁性能JMH
Java lock-free performance JMH
我有一个JMH多线程测试:
@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(value = 1, jvmArgsAppend = { "-Xmx512m", "-server", "-XX:+AggressiveOpts","-XX:+UnlockDiagnosticVMOptions",
"-XX:+UnlockExperimentalVMOptions", "-XX:+PrintAssembly", "-XX:PrintAssemblyOptions=intel",
"-XX:+PrintSignatureHandlers"})
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
public class LinkedQueueBenchmark {
private static final Unsafe unsafe = UnsafeProvider.getUnsafe();
private static final long offsetObject;
private static final long offsetNext;
private static final int THREADS = 5;
private static class Node {
private volatile Node next;
public Node() {}
}
static {
try {
offsetObject = unsafe.objectFieldOffset(LinkedQueueBenchmark.class.getDeclaredField("object"));
offsetNext = unsafe.objectFieldOffset(Node.class.getDeclaredField("next"));
} catch (Exception ex) { throw new Error(ex); }
}
protected long t0,t1,t2,t3,t4,t5,t6,t7;
private volatile Node object = new Node(null);
@Threads(THREADS)
@Benchmark
public Node doTestCasSmart() {
Node current, o = new Node();
for(;;) {
current = this.object;
if (unsafe.compareAndSwapObject(this, offsetObject, current, o)) {
//current.next = o; //Special line:
break;
} else {
LockSupport.parkNanos(1);
}
}
return current;
}
}
- 在当前变体中,我的性能为 ~ 55 ops/us
- 但是,如果我取消注释 "Special line",或将其替换为 unsafe.putOrderedObject(在任何方向 - current.next = o 或 o.next = 当前), 性能 ~ 2 ops/us.
据我所知,CPU-缓存会发生这种情况,也许它正在清理存储缓冲区。如果我将它替换为基于锁的方法,没有 CAS,性能将是 11-20 ops/us.
我尝试使用 LinuxPerfAsmProfiler 和 PrintAssembly,在第二种情况下我看到:
....[Hottest Regions]...............................................................................
25.92% 17.93% [0x7f1d5105fe60:0x7f1d5105fe69] in SpinPause (libjvm.so)
17.53% 20.62% [0x7f1d5119dd88:0x7f1d5119de57] in ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*) const (libjvm.so)
10.81% 6.30% [0x7f1d5129cff5:0x7f1d5129d0ed] in ParallelTaskTerminator::offer_termination(TerminatorTerminator*) (libjvm.so)
7.99% 9.86% [0x7f1d3c51d280:0x7f1d3c51d3a2] in com.jad.generated.LinkedQueueBenchmark_doTestCasSmart::doTestCasSmart_thrpt_jmhStub
谁能给我解释一下到底发生了什么?为什么这么慢?这里的存储负载屏障在哪里?为什么 putOrdered 不起作用?以及如何修复它?
规则:与其寻找 "advanced" 答案,不如先寻找愚蠢的错误。
SpinPause
、ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*)
和 ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
来自 GC 线程。这很可能意味着基准测试所做的大部分工作都是 GC。事实上,运行宁 "special line" 未注释 -prof gc
产量:
# Run complete. Total time: 00:00:43
Benchmark Mode Cnt Score Error Units
LQB.doTestCasSmart thrpt 5 5.930 ± 3.867 ops/us
LQB.doTestCasSmart:·gc.time thrpt 5 29970.000 ms
因此,在 运行 的 43 秒中,您已经用了 30 秒进行 GC。或者,即使是普通的 -verbose:gc
也会显示它:
Iteration 3: [Full GC (Ergonomics) 408188K->1542K(454656K), 0.0043022 secs]
[GC (Allocation Failure) 60422K->60174K(454656K), 0.2061024 secs]
[GC (Allocation Failure) 119054K->118830K(454656K), 0.2314572 secs]
[GC (Allocation Failure) 177710K->177430K(454656K), 0.2268396 secs]
[GC (Allocation Failure) 236310K->236054K(454656K), 0.1718049 secs]
[GC (Allocation Failure) 294934K->294566K(454656K), 0.2265855 secs]
[Full GC (Ergonomics) 294566K->147408K(466432K), 0.7139546 secs]
[GC (Allocation Failure) 206288K->205880K(466432K), 0.2065388 secs]
[GC (Allocation Failure) 264760K->264312K(466432K), 0.2314117 secs]
[GC (Allocation Failure) 323192K->323016K(466432K), 0.2183271 secs]
[Full GC (Ergonomics) 323016K->322663K(466432K), 2.8058725 secs]
2.8 秒的完整 GC,太糟糕了。在 GC 中花费了大约 5 秒,在以 运行 时间的 5 秒为界的迭代中。这也太烂了。
这是为什么?好吧,您正在那里构建链表。当然,队列的头部是不可访问的,并且应该收集从头部到你的 object
的所有内容。但收集不是即时的。队列越长,消耗的内存越多,GC 遍历它的工作就越多。这是一个削弱执行力的正反馈循环。由于那里的队列元素无论如何都是可收集的,因此此反馈循环永远不会到达 OOME。在新的 head
字段中存储初始 object
将使测试最终 OOME。
因此,坦率地说,您的问题与 putOrdered
、内存障碍或队列性能无关。我认为您需要重新考虑您实际测试的内容。设计测试以使每次 @Benchmark
调用的瞬时内存占用量保持不变本身就是一门艺术。
我有一个JMH多线程测试:
@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(value = 1, jvmArgsAppend = { "-Xmx512m", "-server", "-XX:+AggressiveOpts","-XX:+UnlockDiagnosticVMOptions",
"-XX:+UnlockExperimentalVMOptions", "-XX:+PrintAssembly", "-XX:PrintAssemblyOptions=intel",
"-XX:+PrintSignatureHandlers"})
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
public class LinkedQueueBenchmark {
private static final Unsafe unsafe = UnsafeProvider.getUnsafe();
private static final long offsetObject;
private static final long offsetNext;
private static final int THREADS = 5;
private static class Node {
private volatile Node next;
public Node() {}
}
static {
try {
offsetObject = unsafe.objectFieldOffset(LinkedQueueBenchmark.class.getDeclaredField("object"));
offsetNext = unsafe.objectFieldOffset(Node.class.getDeclaredField("next"));
} catch (Exception ex) { throw new Error(ex); }
}
protected long t0,t1,t2,t3,t4,t5,t6,t7;
private volatile Node object = new Node(null);
@Threads(THREADS)
@Benchmark
public Node doTestCasSmart() {
Node current, o = new Node();
for(;;) {
current = this.object;
if (unsafe.compareAndSwapObject(this, offsetObject, current, o)) {
//current.next = o; //Special line:
break;
} else {
LockSupport.parkNanos(1);
}
}
return current;
}
}
- 在当前变体中,我的性能为 ~ 55 ops/us
- 但是,如果我取消注释 "Special line",或将其替换为 unsafe.putOrderedObject(在任何方向 - current.next = o 或 o.next = 当前), 性能 ~ 2 ops/us.
据我所知,CPU-缓存会发生这种情况,也许它正在清理存储缓冲区。如果我将它替换为基于锁的方法,没有 CAS,性能将是 11-20 ops/us.
我尝试使用 LinuxPerfAsmProfiler 和 PrintAssembly,在第二种情况下我看到:
....[Hottest Regions]...............................................................................
25.92% 17.93% [0x7f1d5105fe60:0x7f1d5105fe69] in SpinPause (libjvm.so)
17.53% 20.62% [0x7f1d5119dd88:0x7f1d5119de57] in ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*) const (libjvm.so)
10.81% 6.30% [0x7f1d5129cff5:0x7f1d5129d0ed] in ParallelTaskTerminator::offer_termination(TerminatorTerminator*) (libjvm.so)
7.99% 9.86% [0x7f1d3c51d280:0x7f1d3c51d3a2] in com.jad.generated.LinkedQueueBenchmark_doTestCasSmart::doTestCasSmart_thrpt_jmhStub
谁能给我解释一下到底发生了什么?为什么这么慢?这里的存储负载屏障在哪里?为什么 putOrdered 不起作用?以及如何修复它?
规则:与其寻找 "advanced" 答案,不如先寻找愚蠢的错误。
SpinPause
、ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*)
和 ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
来自 GC 线程。这很可能意味着基准测试所做的大部分工作都是 GC。事实上,运行宁 "special line" 未注释 -prof gc
产量:
# Run complete. Total time: 00:00:43
Benchmark Mode Cnt Score Error Units
LQB.doTestCasSmart thrpt 5 5.930 ± 3.867 ops/us
LQB.doTestCasSmart:·gc.time thrpt 5 29970.000 ms
因此,在 运行 的 43 秒中,您已经用了 30 秒进行 GC。或者,即使是普通的 -verbose:gc
也会显示它:
Iteration 3: [Full GC (Ergonomics) 408188K->1542K(454656K), 0.0043022 secs]
[GC (Allocation Failure) 60422K->60174K(454656K), 0.2061024 secs]
[GC (Allocation Failure) 119054K->118830K(454656K), 0.2314572 secs]
[GC (Allocation Failure) 177710K->177430K(454656K), 0.2268396 secs]
[GC (Allocation Failure) 236310K->236054K(454656K), 0.1718049 secs]
[GC (Allocation Failure) 294934K->294566K(454656K), 0.2265855 secs]
[Full GC (Ergonomics) 294566K->147408K(466432K), 0.7139546 secs]
[GC (Allocation Failure) 206288K->205880K(466432K), 0.2065388 secs]
[GC (Allocation Failure) 264760K->264312K(466432K), 0.2314117 secs]
[GC (Allocation Failure) 323192K->323016K(466432K), 0.2183271 secs]
[Full GC (Ergonomics) 323016K->322663K(466432K), 2.8058725 secs]
2.8 秒的完整 GC,太糟糕了。在 GC 中花费了大约 5 秒,在以 运行 时间的 5 秒为界的迭代中。这也太烂了。
这是为什么?好吧,您正在那里构建链表。当然,队列的头部是不可访问的,并且应该收集从头部到你的 object
的所有内容。但收集不是即时的。队列越长,消耗的内存越多,GC 遍历它的工作就越多。这是一个削弱执行力的正反馈循环。由于那里的队列元素无论如何都是可收集的,因此此反馈循环永远不会到达 OOME。在新的 head
字段中存储初始 object
将使测试最终 OOME。
因此,坦率地说,您的问题与 putOrdered
、内存障碍或队列性能无关。我认为您需要重新考虑您实际测试的内容。设计测试以使每次 @Benchmark
调用的瞬时内存占用量保持不变本身就是一门艺术。