Java Math.abs(int) 优化,为什么这段代码慢了 6 倍?
Java Math.abs(int) optimizations, why this code 6x times slower?
如您所知,Math.abs(Integer.MIN_VALUE) == Integer.MIN_VALUE
为了防止出现负值,在我的项目中实现了 safeAbs
方法:
public static int safeAbs(int i) {
i = Math.abs(i);
return i < 0 ? 0 : i;
}
我将性能与以下性能进行了比较:
public static int safeAbs(int i) {
return i == Integer.MIN_VALUE ? 0 : Math.abs(i);
}
而且第一个比第二个慢近 6 倍(第二个性能几乎与 "pure" Math.abs(int) 相同)。从我的角度来看,字节码没有显着差异,但我猜 JIT "assembly" 代码中存在差异:
"slow"版本:
0x00007f0149119720: mov %eax,0xfffffffffffec000(%rsp)
0x00007f0149119727: push %rbp
0x00007f0149119728: sub [=14=]x20,%rsp
0x00007f014911972c: test %esi,%esi
0x00007f014911972e: jl 0x7f0149119734
0x00007f0149119730: mov %esi,%eax
0x00007f0149119732: jmp 0x7f014911973c
0x00007f0149119734: neg %esi
0x00007f0149119736: test %esi,%esi
0x00007f0149119738: jl 0x7f0149119748
0x00007f014911973a: mov %esi,%eax
0x00007f014911973c: add [=14=]x20,%rsp
0x00007f0149119740: pop %rbp
0x00007f0149119741: test %eax,0x1772e8b9(%rip) ; {poll_return}
0x00007f0149119747: retq
0x00007f0149119748: mov %esi,(%rsp)
0x00007f014911974b: mov [=14=]xffffff65,%esi
0x00007f0149119750: nop
0x00007f0149119753: callq 0x7f01490051a0 ; OopMap{off=56}
;*ifge
; - math.FastAbs::safeAbsSlow@6 (line 16)
; {runtime_call}
0x00007f0149119758: callq 0x7f015f521d20 ; {runtime_call}
"normal"版本:
# {method} {0x00007f31acf28cd8} 'safeAbsFast' '(I)I' in 'math/FastAbs'
# parm0: rsi = int
# [sp+0x30] (sp of caller)
0x00007f31b08c7360: mov %eax,0xfffffffffffec000(%rsp)
0x00007f31b08c7367: push %rbp
0x00007f31b08c7368: sub [=15=]x20,%rsp
0x00007f31b08c736c: cmp [=15=]x80000000,%esi
0x00007f31b08c7372: je 0x7f31b08c738e
0x00007f31b08c7374: mov %esi,%r10d
0x00007f31b08c7377: neg %r10d
0x00007f31b08c737a: test %esi,%esi
0x00007f31b08c737c: mov %esi,%eax
0x00007f31b08c737e: cmovl %r10d,%eax
0x00007f31b08c7382: add [=15=]x20,%rsp
0x00007f31b08c7386: pop %rbp
0x00007f31b08c7387: test %eax,0x162c2c73(%rip) ; {poll_return}
0x00007f31b08c738d: retq
0x00007f31b08c738e: mov %esi,(%rsp)
0x00007f31b08c7391: mov [=15=]xffffff65,%esi
0x00007f31b08c7396: nop
0x00007f31b08c7397: callq 0x7f31b07b11a0 ; OopMap{off=60}
;*if_icmpne
; - math.FastAbs::safeAbsFast@3 (line 17)
; {runtime_call}
0x00007f31b08c739c: callq 0x7f31c5863d20 ; {runtime_call}
基准代码:
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1, jvmArgsAppend = {"-Xms3g", "-Xmx3g", "-server"})
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Threads(1)
@Warmup(iterations = 10)
@Measurement(iterations = 10)
public class SafeAbsMicroBench {
@State(Scope.Benchmark)
public static class Data {
final int len = 10_000_000;
final int[] values = new int[len];
@Setup(Level.Trial)
public void setup() {
// preparing 10 million random integers without MIN_VALUE
for (int i = 0; i < len; i++) {
int val;
do {
val = ThreadLocalRandom.current().nextInt();
} while (val == Integer.MIN_VALUE);
values[i] = val;
}
}
}
@Benchmark
public int safeAbsSlow(Data data) {
int sum = 0;
for (int i = 0; i < data.len; i++)
sum += safeAbsSlow(data.values[i]);
return sum;
}
@Benchmark
public int safeAbsFast(Data data) {
int sum = 0;
for (int i = 0; i < data.len; i++)
sum += safeAbsFast(data.values[i]);
return sum;
}
private int safeAbsSlow(int i) {
i = Math.abs(i);
return i < 0 ? 0 : i;
}
private int safeAbsFast(int i) {
return i == Integer.MIN_VALUE ? 0 : Math.abs(i);
}
public static void main(String[] args) throws RunnerException {
final Options options = new OptionsBuilder()
.include(SafeAbsMicroBench.class.getSimpleName())
.build();
new Runner(options).run();
}
}
结果(Linux x86-64、7820HQ,在 oracle jdk 8 和 11 上检查结果非常相似)。
Benchmark Mode Cnt Score Error Units
SafeAbsMicroBench.safeAbsFast avgt 10 6435155.516 ± 47130.767 ns/op
SafeAbsMicroBench.safeAbsSlow avgt 10 35646411.744 ± 776173.621 ns/op
有人可以解释为什么第一个代码比第二个代码慢得多吗?
safeAbsSlow
和 safeAbsFast
方法生成的本机代码存在差异。
safeAbsSlow
(C2,4级):
0x0000023d12ec4b14: add eax,ecx
0x0000023d12ec4b16: inc ebx
0x0000023d12ec4b18: cmp ebx,989680h
0x0000023d12ec4b1e: jnl 23d12ec4b4eh ; jump if `ebx` was not less than `10_000_000`
0x0000023d12ec4b20: mov ecx,dword ptr [r9+rbx*4+10h]
0x0000023d12ec4b25: test ecx,ecx
0x0000023d12ec4b27: jnl 23d12ec4b14h ; jump if `ecx` was not less-than `0`
0x0000023d12ec4b29: neg ecx
0x0000023d12ec4b2b: test ecx,ecx
0x0000023d12ec4b2d: jnl 23d12ec4b14h ; jump if `ecx` was not less-than `0`
safeAbsFast
(C2,4级):
0x000001d89e8a4b20: mov ecx,dword ptr [r9+rdi*4+10h]
0x000001d89e8a4b25: cmp ecx,80000000h
0x000001d89e8a4b2b: je 1d89e8a4b66h ; jump if `ecx` was equal to `2147483648`
0x000001d89e8a4b2d: mov r11d,ecx
0x000001d89e8a4b30: neg r11d
0x000001d89e8a4b33: test ecx,ecx
0x000001d89e8a4b35: cmovl ecx,r11d
0x000001d89e8a4b39: add eax,ecx
0x000001d89e8a4b3b: inc edi
0x000001d89e8a4b3d: cmp edi,989680h
0x000001d89e8a4b43: jl 1d89e8a4b20h ; jump if `edi` was less than `10_000_000`
从上面我们可以看出,safeAbsSlow
比safeAbsFast
有更多的条件跳转。
这尤其是因为内联到 safeAbsFast
中的 Math.abs
实现没有条件跳转:
0x000001d89e8a4b2d: mov r11d,ecx
0x000001d89e8a4b30: neg r11d
0x000001d89e8a4b33: test ecx,ecx
0x000001d89e8a4b35: cmovl ecx,r11d
因此,当数据集具有分散在数组中的正值和负值时,与 normal
版本相比,slow
版本中有更多的分支未命中.下面是使用 perf
Linux 分析器收集的相应统计数据:
Benchmark Mode Cnt Score Error Units
safeAbsFast avgt 10 9611659.726 ± 1429082.431 ns/op
safeAbsFast:branch-misses avgt 2869.853 #/op
safeAbsFast:branches avgt 12492918.020 #/op
safeAbsFast:cycles avgt 28212203.936 #/op
safeAbsFast:instructions avgt 92352048.153 #/op
safeAbsSlow avgt 10 44524180.366 ± 6324887.086 ns/op
safeAbsSlow:branch-misses avgt 5006493.144 #/op
safeAbsSlow:branches avgt 17496069.911 #/op
safeAbsSlow:cycles avgt 126413171.674 #/op
safeAbsSlow:instructions avgt 67549877.558 #/op
相比之下,这里是排序数据集的结果:
Benchmark Mode Cnt Score Error Units
safeAbsFast avgt 10 9026800.584 ± 528992.157 ns/op
safeAbsFast:branch-misses avgt 2785.463 #/op
safeAbsFast:branches avgt 12474751.905 #/op
safeAbsFast:cycles avgt 27379727.603 #/op
safeAbsFast:instructions avgt 92418075.715 #/op
safeAbsSlow avgt 10 6981828.374 ± 2375480.834 ns/op
safeAbsSlow:branch-misses avgt 2801.022 #/op
safeAbsSlow:branches avgt 17496585.992 #/op
safeAbsSlow:cycles avgt 19478382.113 #/op
safeAbsSlow:instructions avgt 67589946.278 #/op
之前的 slow
版本在对数据集进行排序时变得更快(在这种情况下,代价高昂的分支未命中被最小化)。
环境:
openjdk version "12-internal" 2019-03-19
OpenJDK Runtime Environment (slowdebug build 12-internal+0-adhoc.jdk12)
OpenJDK 64-Bit Server VM (slowdebug build 12-internal+0-adhoc.jdk12, mixed mode)
如您所知,Math.abs(Integer.MIN_VALUE) == Integer.MIN_VALUE
为了防止出现负值,在我的项目中实现了 safeAbs
方法:
public static int safeAbs(int i) {
i = Math.abs(i);
return i < 0 ? 0 : i;
}
我将性能与以下性能进行了比较:
public static int safeAbs(int i) {
return i == Integer.MIN_VALUE ? 0 : Math.abs(i);
}
而且第一个比第二个慢近 6 倍(第二个性能几乎与 "pure" Math.abs(int) 相同)。从我的角度来看,字节码没有显着差异,但我猜 JIT "assembly" 代码中存在差异:
"slow"版本:
0x00007f0149119720: mov %eax,0xfffffffffffec000(%rsp)
0x00007f0149119727: push %rbp
0x00007f0149119728: sub [=14=]x20,%rsp
0x00007f014911972c: test %esi,%esi
0x00007f014911972e: jl 0x7f0149119734
0x00007f0149119730: mov %esi,%eax
0x00007f0149119732: jmp 0x7f014911973c
0x00007f0149119734: neg %esi
0x00007f0149119736: test %esi,%esi
0x00007f0149119738: jl 0x7f0149119748
0x00007f014911973a: mov %esi,%eax
0x00007f014911973c: add [=14=]x20,%rsp
0x00007f0149119740: pop %rbp
0x00007f0149119741: test %eax,0x1772e8b9(%rip) ; {poll_return}
0x00007f0149119747: retq
0x00007f0149119748: mov %esi,(%rsp)
0x00007f014911974b: mov [=14=]xffffff65,%esi
0x00007f0149119750: nop
0x00007f0149119753: callq 0x7f01490051a0 ; OopMap{off=56}
;*ifge
; - math.FastAbs::safeAbsSlow@6 (line 16)
; {runtime_call}
0x00007f0149119758: callq 0x7f015f521d20 ; {runtime_call}
"normal"版本:
# {method} {0x00007f31acf28cd8} 'safeAbsFast' '(I)I' in 'math/FastAbs'
# parm0: rsi = int
# [sp+0x30] (sp of caller)
0x00007f31b08c7360: mov %eax,0xfffffffffffec000(%rsp)
0x00007f31b08c7367: push %rbp
0x00007f31b08c7368: sub [=15=]x20,%rsp
0x00007f31b08c736c: cmp [=15=]x80000000,%esi
0x00007f31b08c7372: je 0x7f31b08c738e
0x00007f31b08c7374: mov %esi,%r10d
0x00007f31b08c7377: neg %r10d
0x00007f31b08c737a: test %esi,%esi
0x00007f31b08c737c: mov %esi,%eax
0x00007f31b08c737e: cmovl %r10d,%eax
0x00007f31b08c7382: add [=15=]x20,%rsp
0x00007f31b08c7386: pop %rbp
0x00007f31b08c7387: test %eax,0x162c2c73(%rip) ; {poll_return}
0x00007f31b08c738d: retq
0x00007f31b08c738e: mov %esi,(%rsp)
0x00007f31b08c7391: mov [=15=]xffffff65,%esi
0x00007f31b08c7396: nop
0x00007f31b08c7397: callq 0x7f31b07b11a0 ; OopMap{off=60}
;*if_icmpne
; - math.FastAbs::safeAbsFast@3 (line 17)
; {runtime_call}
0x00007f31b08c739c: callq 0x7f31c5863d20 ; {runtime_call}
基准代码:
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1, jvmArgsAppend = {"-Xms3g", "-Xmx3g", "-server"})
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Threads(1)
@Warmup(iterations = 10)
@Measurement(iterations = 10)
public class SafeAbsMicroBench {
@State(Scope.Benchmark)
public static class Data {
final int len = 10_000_000;
final int[] values = new int[len];
@Setup(Level.Trial)
public void setup() {
// preparing 10 million random integers without MIN_VALUE
for (int i = 0; i < len; i++) {
int val;
do {
val = ThreadLocalRandom.current().nextInt();
} while (val == Integer.MIN_VALUE);
values[i] = val;
}
}
}
@Benchmark
public int safeAbsSlow(Data data) {
int sum = 0;
for (int i = 0; i < data.len; i++)
sum += safeAbsSlow(data.values[i]);
return sum;
}
@Benchmark
public int safeAbsFast(Data data) {
int sum = 0;
for (int i = 0; i < data.len; i++)
sum += safeAbsFast(data.values[i]);
return sum;
}
private int safeAbsSlow(int i) {
i = Math.abs(i);
return i < 0 ? 0 : i;
}
private int safeAbsFast(int i) {
return i == Integer.MIN_VALUE ? 0 : Math.abs(i);
}
public static void main(String[] args) throws RunnerException {
final Options options = new OptionsBuilder()
.include(SafeAbsMicroBench.class.getSimpleName())
.build();
new Runner(options).run();
}
}
结果(Linux x86-64、7820HQ,在 oracle jdk 8 和 11 上检查结果非常相似)。
Benchmark Mode Cnt Score Error Units
SafeAbsMicroBench.safeAbsFast avgt 10 6435155.516 ± 47130.767 ns/op
SafeAbsMicroBench.safeAbsSlow avgt 10 35646411.744 ± 776173.621 ns/op
有人可以解释为什么第一个代码比第二个代码慢得多吗?
safeAbsSlow
和 safeAbsFast
方法生成的本机代码存在差异。
safeAbsSlow
(C2,4级):
0x0000023d12ec4b14: add eax,ecx
0x0000023d12ec4b16: inc ebx
0x0000023d12ec4b18: cmp ebx,989680h
0x0000023d12ec4b1e: jnl 23d12ec4b4eh ; jump if `ebx` was not less than `10_000_000`
0x0000023d12ec4b20: mov ecx,dword ptr [r9+rbx*4+10h]
0x0000023d12ec4b25: test ecx,ecx
0x0000023d12ec4b27: jnl 23d12ec4b14h ; jump if `ecx` was not less-than `0`
0x0000023d12ec4b29: neg ecx
0x0000023d12ec4b2b: test ecx,ecx
0x0000023d12ec4b2d: jnl 23d12ec4b14h ; jump if `ecx` was not less-than `0`
safeAbsFast
(C2,4级):
0x000001d89e8a4b20: mov ecx,dword ptr [r9+rdi*4+10h]
0x000001d89e8a4b25: cmp ecx,80000000h
0x000001d89e8a4b2b: je 1d89e8a4b66h ; jump if `ecx` was equal to `2147483648`
0x000001d89e8a4b2d: mov r11d,ecx
0x000001d89e8a4b30: neg r11d
0x000001d89e8a4b33: test ecx,ecx
0x000001d89e8a4b35: cmovl ecx,r11d
0x000001d89e8a4b39: add eax,ecx
0x000001d89e8a4b3b: inc edi
0x000001d89e8a4b3d: cmp edi,989680h
0x000001d89e8a4b43: jl 1d89e8a4b20h ; jump if `edi` was less than `10_000_000`
从上面我们可以看出,safeAbsSlow
比safeAbsFast
有更多的条件跳转。
这尤其是因为内联到 safeAbsFast
中的 Math.abs
实现没有条件跳转:
0x000001d89e8a4b2d: mov r11d,ecx
0x000001d89e8a4b30: neg r11d
0x000001d89e8a4b33: test ecx,ecx
0x000001d89e8a4b35: cmovl ecx,r11d
因此,当数据集具有分散在数组中的正值和负值时,与 normal
版本相比,slow
版本中有更多的分支未命中.下面是使用 perf
Linux 分析器收集的相应统计数据:
Benchmark Mode Cnt Score Error Units
safeAbsFast avgt 10 9611659.726 ± 1429082.431 ns/op
safeAbsFast:branch-misses avgt 2869.853 #/op
safeAbsFast:branches avgt 12492918.020 #/op
safeAbsFast:cycles avgt 28212203.936 #/op
safeAbsFast:instructions avgt 92352048.153 #/op
safeAbsSlow avgt 10 44524180.366 ± 6324887.086 ns/op
safeAbsSlow:branch-misses avgt 5006493.144 #/op
safeAbsSlow:branches avgt 17496069.911 #/op
safeAbsSlow:cycles avgt 126413171.674 #/op
safeAbsSlow:instructions avgt 67549877.558 #/op
相比之下,这里是排序数据集的结果:
Benchmark Mode Cnt Score Error Units
safeAbsFast avgt 10 9026800.584 ± 528992.157 ns/op
safeAbsFast:branch-misses avgt 2785.463 #/op
safeAbsFast:branches avgt 12474751.905 #/op
safeAbsFast:cycles avgt 27379727.603 #/op
safeAbsFast:instructions avgt 92418075.715 #/op
safeAbsSlow avgt 10 6981828.374 ± 2375480.834 ns/op
safeAbsSlow:branch-misses avgt 2801.022 #/op
safeAbsSlow:branches avgt 17496585.992 #/op
safeAbsSlow:cycles avgt 19478382.113 #/op
safeAbsSlow:instructions avgt 67589946.278 #/op
之前的 slow
版本在对数据集进行排序时变得更快(在这种情况下,代价高昂的分支未命中被最小化)。
环境:
openjdk version "12-internal" 2019-03-19
OpenJDK Runtime Environment (slowdebug build 12-internal+0-adhoc.jdk12)
OpenJDK 64-Bit Server VM (slowdebug build 12-internal+0-adhoc.jdk12, mixed mode)