JMH

Question

请阅读这个问题的最新编辑。

问题： 我需要编写一个正确的基准来比较使用不同线程池[=的不同工作 65=] 使用不同的执行方法实现（也来自外部库）到使用其他线程池[=65]的其他工作 =] 实现和工作没有任何线程。

例如，我有 24 个任务要完成，10000 个随机字符串处于基准状态：

@OutputTimeUnit(TimeUnit.NANOSECONDS) @Fork(1) @BenchmarkMode(Mode.AverageTime) @Warmup(iterations = 3) @Measurement(iterations = 3) @State(Scope.Benchmark) public class ThreadPoolSamples { @Param({"24"}) int amountOfTasks; private static final int tts = Runtime.getRuntime().availableProcessors() * 2; private String[] strs = new String[10000]; @Setup public void setup() { for (int i = 0; i < strs.length; i++) { strs[i] = String.valueOf(Math.random()); } } }

还有两个状态作为内部 classes 表示工作（字符串连接）和 ExecutorService 设置和关闭：

@State(Scope.Thread) public static class Work { public String doWork(String[] strs) { StringBuilder conc = new StringBuilder(); for (String str : strs) { conc.append(str); } return conc.toString(); } } @State(Scope.Benchmark) public static class ExecutorServiceState { ExecutorService service; @Setup(Level.Iteration) public void setupMethod() { service = Executors.newFixedThreadPool(tts); } @TearDown(Level.Iteration) public void downMethod() { service.shutdownNow(); service = null; } }

更严格的问题是：如何编写正确的基准来测量 doWork() 的平均时间；第一：没有任何线程，第二：使用 .execute() 方法，第三：使用 .submit() 方法稍后获取期货结果。我尝试编写的实现：

@Benchmark public void noThreading(Work w, Blackhole bh) { for (int i = 0; i < amountOfTasks; i++) { bh.consume(w.doWork(strs)); } } @Benchmark public void executorService(ExecutorServiceState e, Work w, Blackhole bh) { for (int i = 0; i < amountOfTasks; i++) { e.service.execute(() -> bh.consume(w.doWork(strs))); } } @Benchmark public void noThreadingResult(Work w, Blackhole bh) { String[] strss = new String[amountOfTasks]; for (int i = 0; i < amountOfTasks; i++) { strss[i] = w.doWork(strs); } bh.consume(strss); } @Benchmark public void executorServiceResult(ExecutorServiceState e, Work w, Blackhole bh) throws ExecutionException, InterruptedException { Future[] strss = new Future[amountOfTasks]; for (int i = 0; i < amountOfTasks; i++) { strss[i] = e.service.submit(() -> {return w.doWork(strs);}); } for (Future future : strss) { bh.consume(future.get()); } }

在我的 PC（2 核，4 线程）上对这个实现进行基准测试后，我得到：

Benchmark (amountOfTasks) Mode Cnt Score Error Units ThreadPoolSamples.executorService 24 avgt 3 255102,966 ± 4460279,056 ns/op ThreadPoolSamples.executorServiceResult 24 avgt 3 19790020,180 ± 7676762,394 ns/op ThreadPoolSamples.noThreading 24 avgt 3 18881360,497 ± 340778,773 ns/op ThreadPoolSamples.noThreadingResult 24 avgt 3 19283976,445 ± 471788,642 ns/op

noThreading 和 executorService 可能是正确的（但我仍然不确定）并且 noThreadingResult 和 executorServiceResult 看起来根本不正确。

编辑：

我发现了一些新的细节，但我认为结果仍然不正确：作为回答 user17280749 in 回答线程池没有等待提交的任务完成，但不仅仅是一个问题：javac 还以某种方式优化了 Work class 中的 doWork() 方法（可能 JVM 可以预测该操作的结果），因此为简单起见，我使用 Thread.sleep() 作为“工作”并设置amountOfTasks 新的两个参数：“1”和“128”以证明在 1 任务线程上将比 noThreading 慢，24 和 128 大约是。比 noThreading 快四倍，也是为了测量的正确性我在基准测试中设置线程池启动和关闭：

package io.denery; import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.infra.Blackhole; import java.util.concurrent.*; @OutputTimeUnit(TimeUnit.NANOSECONDS) @Fork(1) @BenchmarkMode(Mode.AverageTime) @Warmup(iterations = 3) @Measurement(iterations = 3) @State(Scope.Benchmark) public class ThreadPoolSamples { @Param({"1", "24", "128"}) int amountOfTasks; private static final int tts = Runtime.getRuntime().availableProcessors() * 2; @State(Scope.Thread) public static class Work { public void doWork() { try { Thread.sleep(1); } catch (InterruptedException e) { e.printStackTrace(); } } } @Benchmark public void noThreading(Work w) { for (int i = 0; i < amountOfTasks; i++) { w.doWork(); } } @Benchmark public void fixedThreadPool(Work w) throws ExecutionException, InterruptedException { ExecutorService service = Executors.newFixedThreadPool(tts); Future[] futures = new Future[amountOfTasks]; for (int i = 0; i < amountOfTasks; i++) { futures[i] = service.submit(w::doWork); } for (Future future : futures) { future.get(); } service.shutdown(); } @Benchmark public void cachedThreadPool(Work w) throws ExecutionException, InterruptedException { ExecutorService service = Executors.newCachedThreadPool(); Future[] futures = new Future[amountOfTasks]; for (int i = 0; i < amountOfTasks; i++) { futures[i] = service.submit(() -> { w.doWork(); }); } for (Future future : futures) { future.get(); } service.shutdown(); } }

这个基准测试的结果是：

Benchmark (amountOfTasks) Mode Cnt Score Error Units ThreadPoolSamples.cachedThreadPool 1 avgt 3 1169075,866 ± 47607,783 ns/op ThreadPoolSamples.cachedThreadPool 24 avgt 3 5208437,498 ± 4516260,543 ns/op ThreadPoolSamples.cachedThreadPool 128 avgt 3 13112351,066 ± 1905089,389 ns/op ThreadPoolSamples.fixedThreadPool 1 avgt 3 1166087,665 ± 61193,085 ns/op ThreadPoolSamples.fixedThreadPool 24 avgt 3 4721503,799 ± 313206,519 ns/op ThreadPoolSamples.fixedThreadPool 128 avgt 3 18337097,997 ± 5781847,191 ns/op ThreadPoolSamples.noThreading 1 avgt 3 1066035,522 ± 83736,346 ns/op ThreadPoolSamples.noThreading 24 avgt 3 25525744,055 ± 45422,015 ns/op ThreadPoolSamples.noThreading 128 avgt 3 136126357,514 ± 200461,808 ns/op

我们看到错误并不是很大，任务 1 的线程池比 noThreading 慢，但是如果你比较 25525744,055 和 4721503,799，加速比是：5.406，它比预期的要快 ~ 4，如果比较 136126357,514 和 18337097,997，加速是：7.4，这个 fake 加速随着 amountOfTasks 的增加而增长，我认为它仍然是不正确的。我想使用 PrintAssembly 来查看是否有任何 JVM 优化。

编辑：

正如提到的 user17294549 in 回答，我使用 Thread.sleep() 作为对实际工作的模仿，它不正确，因为：

for real work: only 2 tasks can run simultaneously on a 2-core system for Thread.sleep(): any number of tasks can run simultaneously on a 2-core system

我想起了Blackhole.consumeCPU(long tokens)JMH方法“烧周期”模仿一个作品，有JMH example and docs。所以我把工作改为：

@State(Scope.Thread) public static class Work { public void doWork() { Blackhole.consumeCPU(4096); } }

此更改的基准：

Benchmark (amountOfTasks) Mode Cnt Score Error Units ThreadPoolSamples.cachedThreadPool 1 avgt 3 301187,897 ± 95819,153 ns/op ThreadPoolSamples.cachedThreadPool 24 avgt 3 2421815,991 ± 545978,808 ns/op ThreadPoolSamples.cachedThreadPool 128 avgt 3 6648647,025 ± 30442,510 ns/op ThreadPoolSamples.cachedThreadPool 2048 avgt 3 60229404,756 ± 21537786,512 ns/op ThreadPoolSamples.fixedThreadPool 1 avgt 3 293364,540 ± 10709,841 ns/op ThreadPoolSamples.fixedThreadPool 24 avgt 3 1459852,773 ± 160912,520 ns/op ThreadPoolSamples.fixedThreadPool 128 avgt 3 2846790,222 ± 78929,182 ns/op ThreadPoolSamples.fixedThreadPool 2048 avgt 3 25102603,592 ± 1825740,124 ns/op ThreadPoolSamples.noThreading 1 avgt 3 10071,049 ± 407,519 ns/op ThreadPoolSamples.noThreading 24 avgt 3 241561,416 ± 15326,274 ns/op ThreadPoolSamples.noThreading 128 avgt 3 1300241,347 ± 148051,168 ns/op ThreadPoolSamples.noThreading 2048 avgt 3 20683253,408 ± 1433365,542 ns/op

我们看到 fixedThreadPool 在某种程度上比没有线程的示例慢，并且当 amountOfTasks 较大时，fixedThreadPool 和 noThreading 示例之间的差异较小。那里发生了什么？我在这个问题的开头看到了与字符串连接相同的现象，但我没有报告。（顺便说一句，感谢阅读这本小说并试图回答这个问题的人，你真的帮了我）

Answer 1

查看 this question 的答案以了解如何在 java 中编写基准。

... executorService maybe correct (but i am still unsure) ...

Benchmark                              (amountOfTasks)  Mode  Cnt         Score         Error  Units
ThreadPoolSamples.executorService                     24  avgt    3    255102,966 ± 4460279,056  ns/op

它看起来不像是正确的结果：误差 4460279,056 是基值 255102,966.

的 17 倍

你还有一个错误：

@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
    for (int i = 0; i < amountOfTasks; i++) {
         e.service.execute(() -> bh.consume(w.doWork(strs)));
    }
}

您将任务提交给 ExecutorService，但没有等待它们完成。

Answer 2

看这段代码：

    @TearDown(Level.Iteration)
    public void downMethod() {
        service.shutdownNow();
        service = null;
    }

您不必等待线程停止。阅读 the docs 了解详情。
因此，您的某些基准测试可能运行与之前基准测试中 cachedThreadPool 产生的另外 128 个线程并行。

so for simplicity I used Thread.sleep() as "work"

你确定吗？
实际工作和Thread.sleep()有很大区别：

对于实际工作：只有 2 个任务可以在 2 核系统上同时运行
对于 Thread.sleep()：任意数量的任务可以在 2 核系统上同时运行

Answer 3

这是我在我的机器上得到的信息（也许这可以帮助您了解问题所在）：

这是基准测试（我稍微修改了一下）：

package io.denery;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.Main;
import java.util.concurrent.*;

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Threads(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
  @Param({"1", "24", "128"})
  int amountOfTasks;
  private static final int tts = Runtime.getRuntime().availableProcessors() * 2;

  private static void doWork() {
    Blackhole.consumeCPU(4096);
  }

  public static void main(String[] args) throws Exception {
    Main.main(args);
  }

  @Benchmark
  public void noThreading() {
    for (int i = 0; i < amountOfTasks; i++) {
      doWork();
    }
  }

  @Benchmark
  public void fixedThreadPool(Blackhole bh) throws Exception {
    runInThreadPool(amountOfTasks, bh, Executors.newFixedThreadPool(tts));
  }

  @Benchmark
  public void cachedThreadPool(Blackhole bh) throws Exception {
    runInThreadPool(amountOfTasks, bh, Executors.newCachedThreadPool());
  }

  private static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
      throws Exception {
    Future<?>[] futures = new Future[amountOfTasks];
    for (int i = 0; i < amountOfTasks; i++) {
      futures[i] = threadPool.submit(ThreadPoolSamples::doWork);
    }
    for (Future<?> future : futures) {
      bh.consume(future.get());
    }

    threadPool.shutdownNow();
    threadPool.awaitTermination(5, TimeUnit.MINUTES);
  }
}

规格和版本：

JMH version: 1.33  
VM version: JDK 17.0.1, OpenJDK 64-Bit Server
Linux 5.14.14
CPU: Intel(R) Core(TM) i5-2320 CPU @ 3.00GHz, 4 Cores, No Hyper-Threading

结果：

Benchmark                           (amountOfTasks)  Mode  Cnt        Score        Error  Units
ThreadPoolSamples.cachedThreadPool                1  avgt    5    92968.252 ±   2853.687  ns/op
ThreadPoolSamples.cachedThreadPool               24  avgt    5   547558.977 ±  88937.441  ns/op
ThreadPoolSamples.cachedThreadPool              128  avgt    5  1502909.128 ±  40698.141  ns/op
ThreadPoolSamples.fixedThreadPool                 1  avgt    5    97945.026 ±    435.458  ns/op
ThreadPoolSamples.fixedThreadPool                24  avgt    5   643453.028 ± 135859.966  ns/op
ThreadPoolSamples.fixedThreadPool               128  avgt    5   998425.118 ± 126463.792  ns/op
ThreadPoolSamples.noThreading                     1  avgt    5    10165.462 ±     78.008  ns/op
ThreadPoolSamples.noThreading                    24  avgt    5   245942.867 ±  10594.808  ns/op
ThreadPoolSamples.noThreading                   128  avgt    5  1302173.090 ±   5482.655  ns/op

Answer 4

在其他回答者的帮助下，我自己解决了这个问题。在上次编辑（以及所有其他编辑）中，问题出在我的 gradle 配置中，因此我在我的所有系统线程中运行使用此基准测试，我使用 this gradle plugin 运行 JMH，在我的 gradle 构建脚本中进行所有基准测试之前，我设置了 threads = 4 值，所以您看到了这些奇怪的基准测试结果，因为 JMH 试图对所有可用线程进行基准测试所有可用线程。我删除了这个配置并在基准 class 中设置了 @State(Scope.Thread) 和 @Threads(1) 注释，稍微编辑了 runInThreadPool() 方法到：

public static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
            throws InterruptedException, ExecutionException {
        Future<?>[] futures = new Future[amountOfTasks];
        for (int i = 0; i < amountOfTasks; i++) {
            futures[i] = threadPool.submit(PrioritySchedulerSamples::doWork, (ThreadFactory) runnable -> {
                Thread thread = new Thread(runnable);
                thread.setPriority(10);
                return thread;
            });
        }
        for (Future<?> future : futures) {
            bh.consume(future.get());
        }

        threadPool.shutdownNow();
        threadPool.awaitTermination(10, TimeUnit.SECONDS);
    }

因此此线程池中的每个线程运行具有最高优先级。所有这些变化的基准：

Benchmark                                 (amountOfTasks)  Mode  Cnt         Score         Error  Units
PrioritySchedulerSamples.fixedThreadPool             2048  avgt    3   8021054,516 ± 2874987,327  ns/op
PrioritySchedulerSamples.noThreading                 2048  avgt    3  17583295,617 ± 5499026,016  ns/op

这些结果似乎是正确的。（特别是我的系统。）

我还列出了微基准测试线程池和基本上所有并发 [=51=] 组件中的常见问题：

确保您的微基准测试在一个线程中执行，使用@Threads(1) 和@State(Scope.Thread) 注释使您的微基准测试在一个线程中执行。（例如，使用 htop 命令找出消耗最多 CPU 百分比的线程数和哪些线程）
确保在微基准测试中完全执行任务，并等待所有线程完成此任务。（也许您的微基准测试不等待任务完成？）
请勿使用Thread.sleep()模仿真实作品，JMH提供Blackhole.consumeCPU(long tokens)方法，您可以自由模仿某些作品。
确保您了解要进行基准测试的组件。（很明显，但是我之前不知道post java 线程池很好）
确保你知道these JMH samles中描述的编译器优化效果，基本上对JMH非常了解。

JMH - 如何正确地对线程池进行基准测试？

JMH - How to correctly benchmark Thread Pools?

java

benchmarking

multithreading

microbenchmark