Java 的 JIT 编译器如何使坏代码比好代码更快？

Question

背景

这是一项从数据结构和算法讲座开始的研究。很抱歉背景很长，但是理解问题还是很有必要的。

分区算法 Lomuto and Hoare 进行了基准测试，以查看它们的运行ning 时间是否符合理论预测。分析表明，他们进行了如下交换次数，根据输入数组的长度n，大约为：

洛穆托：n/2
霍尔：n/4；如果大小数在数组中均匀分布
霍尔：n/6；如果数组中大小数分布不均

交换是这些算法中最昂贵的操作，因此可以很好地估计它们在运行宁时间方面的比较。这意味着如果数组是随机的，大小数均等分布，并且 Hoare 的运行ning 时间为 20 micros，Lomuto 应该大约需要 40 micros，慢 2 倍.

微基准：

public static double averageRuntimeTests(Function<int[], Integer> algorithm, int size,
                                         int warmupIterations, int measureIterations, int repeatedExecs) {
    long start, end;
    int result = -1;
    double[] runningTimes = new double[measureIterations];

    // Warmup
    for (int i = 0; i < warmupIterations; i++) {
        int[] A = randomArray(size);
        for (int j = 0; j < repeatedExecs; j++)
            result = algorithm.apply(A);
        System.out.print(result);
    }

    // Measure
    for (int i = 0; i < measureIterations; i++) {
        int[] A = randomArray(size);
        start = System.nanoTime();
        for (int j = 0; j < repeatedExecs; j++)
            result = algorithm.apply(A);
        end = System.nanoTime();
        System.out.print(result);
        runningTimes[i] = (end - start) / (repeatedExecs * 1000.0);
    }

    return average(runningTimes);
}

如果算法在同一个输入数组上运行足够多次，使得 JIT 编译器“有足够的时间”优化代码，则微基准测试符合理论。看下面的代码，其中算法是运行 30次每个不同的数组。

>>> 案例 1

public static void main(String[] args) {
    int size = 10_000, warmupIterations = 10_000, measureIterations = 2_000, repeatedExecs = 30;

    System.out.printf("input size = %d%n", size);
    System.out.printf("#warmup iterations = %d%n", warmupIterations);
    System.out.printf("#measure iterations = %d%n", measureIterations);
    System.out.printf("#repeated executions = %d%n", repeatedExecs);

    double timeHoare = averageRuntimeTests(Partitioning::partitionHoare, size, warmupIterations,
            measureIterations, repeatedExecs);
    double timeLomuto = averageRuntimeTests(Partitioning::partitionLomuto, size, warmupIterations,
            measureIterations, repeatedExecs);

    System.out.printf("%nHoare: %f us/op%n", timeHoare);
    System.out.printf("Lomuto: %f us/op%n", timeLomuto);
}

结果：

洛穆托：7.94 us/op
霍尔：3.7685 us/op

Lomuto 比 Hoare 慢大约 2 倍，正如均匀分布阵列所预期的那样。

Lomuto 在 Hoare 之前运行时的结果：

洛穆托：13.513 us/op
霍尔：3.5865 us/op

Lomuto 比 Hoare 慢 4 倍，超出了应有的水平。出于某种原因，如果运行比 Hoare 早

，则花费的时间几乎是原来的两倍。

问题

但是，如果算法运行对每个不同的输入数组仅一次，则 JIT 编译器会出现意外行为。

>>> 案例 2

public static void main(String[] args) {
    int size = 10_000, warmupIterations = 300_000, measureIterations = 60_000, repeatedExecs = 1;

    System.out.printf("input size = %d%n", size);
    System.out.printf("#warmup iterations = %d%n", warmupIterations);
    System.out.printf("#measure iterations = %d%n", measureIterations);
    System.out.printf("#repeated executions = %d%n", repeatedExecs);

    double timeHoare = averageRuntimeTests(Partitioning::partitionHoare, size, warmupIterations,
            measureIterations, repeatedExecs);
    double timeLomuto = averageRuntimeTests(Partitioning::partitionLomuto, size, warmupIterations,
            measureIterations, repeatedExecs);

    System.out.printf("%nHoare: %f us/op%n", timeHoare);
    System.out.printf("Lomuto: %f us/op%n", timeLomuto);
}

结果（是否在 Lomuto 之前 Hoare 运行s）：

洛穆托：26.676133 us/op
霍尔：31.8233 us/op

洛穆托比霍尔还要快，真是令人震惊！ JIT 编译器在这里做什么？

我一直在说 JIT 编译器是罪魁祸首，因为如果我完全禁用它并且运行在仅解释器模式下（使用 -Djava.compiler=NONE 标志），基准测试又是预期的。运行算法一次每个不同的数组...

>>>案例3

结果（是否在 Lomuto 之前 Hoare 运行s）：

洛穆托：597.76 us/op
霍尔：254.0455 us/op

如您所见，Lomuto 再次像预期的那样慢了大约 2 倍。

有人可以解释一下案例 2 中 JIT 编译器的情况吗？看起来 JIT 编译器只是部分优化了代码。但是，为什么 Lomuto 和 Hoare 一样快？霍尔不应该更快吗？

请注意：

我知道 Java 中有 JMH 库可以可靠地运行微基准测试。我只是想了解 JIT 编译器的底层机制。
micros 是 microseconds 的 shorthand 并且与 us.

Answer 1

How does Java's JIT compiler run bad code faster than good code?

JIT 编译器不运行编码。它编译代码。
您所看到的不是错误代码的（有效）证据运行比好的代码更快。

I know that there is the JMH library to reliably run microbenchmarks in Java. I am just trying to understand the underlying mechanics of the JIT compiler.

JIT 编译器的作用可能并不重要。可能是当它这样做时导致了问题。

JIT 编译器使用 CPU 时间进行优化。如果您以天真的方式实施微基准测试，那么部分或全部 JIT 编译器时间将包含在基准测试的一次（或多次）迭代中。这将使基准迭代（或迭代）看起来比更早或更晚的迭代花费更长的时间。这扭曲了明显的执行时间。

考虑一下，您的 Java 基准运行明显处于三个或更多 ¹ 速度：

它开始运行很慢，因为 JVM 正在解释字节码和收集统计信息。
一旦收集到足够的统计信息，JVM（显然）运行实际上缓慢，而 JIT 编译器正在编译和优化字节码。
JIT 编译器完成其工作后，JVM 开始执行已编译的本机代码，并且代码运行速度更快。

如果您使用 jmh（或类似的），它应该补偿 JIT 编译和其他异常的影响。如果您需要信息，请阅读 How do I write a correct micro-benchmark in Java? ... 这有助于解释常见的陷阱。

^{1 - 我指的是使用系统时钟测量前后获得的表观速度。}

Java 的 JIT 编译器如何使坏代码比好代码更快？

How does Java's JIT compiler make bad code faster than good code?

java

algorithm

performance

jit

microbenchmark

背景

>>> 案例 1

问题

>>> 案例 2

>>>案例3

请注意：