了解 jvm 中的循环性能

Understanding loops performance in jvm

我正在玩 jmh,在关于 looping 的部分,他们说

You might notice the larger the repetitions count, the lower the "perceived" cost of the operation being measured. Up to the point we do each addition with 1/20 ns, well beyond what hardware can actually do. This happens because the loop is heavily unrolled/pipelined, and the operation to be measured is hoisted from the loop. Morale: don't overuse loops, rely on JMH to get the measurement right.

我自己试过了

    @Benchmark
    @OperationsPerInvocation(1)
    public int measurewrong_1() {
        return reps(1);
    }      

    @Benchmark
    @OperationsPerInvocation(1000)
    public int measurewrong_1000() {
        return reps(1000);
    }      

得到如下结果:

Benchmark                      Mode  Cnt  Score    Error  Units
MyBenchmark.measurewrong_1     avgt   15  2.425 ±  0.137  ns/op
MyBenchmark.measurewrong_1000  avgt   15  0.036 ±  0.001  ns/op

它确实表明 MyBenchmark.measurewrong_1000MyBenchmark.measurewrong_1 快得多。但我无法真正理解 JVM 为提高性能所做的优化。

循环是什么意思 unrolled/pipelined?

循环流水线 = 软件流水线。

基本上,这是一种用于优化顺序循环迭代效率的技术,通过执行循环体中的一些指令 - 并行.

当然,只有在满足一定条件的情况下才能做到这一点,比如每次迭代不依赖于另一个等

来自insidehpc.com:

Software pipelining, which really has nothing to do with hardware pipelining, is a loop optimization technique to make statements within an iteration independent of each other. The goal is to remove dependencies so that seemingly sequential instructions may be executed in parallel.

在此处查看更多信息:

循环展开是一种通过重复循环体来展平多个循环迭代的技术。
例如。在给定的例子中

    for (int i = 0; i < reps; i++) {
        s += (x + y);
    }

可以由 JIT 编译器展开为

    for (int i = 0; i < reps - 15; i += 16) {
        s += (x + y);
        s += (x + y);
        // ... 16 times ...
        s += (x + y);
    }

那么扩展的循环体可以进一步优化为

    for (int i = 0; i < reps - 15; i += 16) {
        s += 16 * (x + y);
    }

显然计算16 * (x + y)比计算(x + y)快16倍。

循环展开使流水线成为可能。因此流水线能力 CPU(例如 RISC)可以并行执行展开的代码。

因此,如果您的 CPU 能够并行执行 5 个管道,您的循环将按以下方式展开:

// pseudo code
int pipelines = 5;
for(int i = 0; i < length; i += pipelines){
    s += (x + y);
    s += (x + y);
    s += (x + y);
    s += (x + y);
    s += (x + y);
}

IF = 指令获取,ID = 指令解码,EX = 执行,MEM = 内存访问,WB = 寄存器回写

来自Oracle White paper

... a standard compiler optimization that enables faster loop execution. Loop unrolling increases the loop body size while simultaneously decreasing the number of iterations. Loop unrolling also increases the effectiveness of other optimizations.

关于流水线的更多信息:Classic RISC pipeline