为什么 std::all_of() 的编码版本与调用 std::all_of() 的基准测试不同?

Why does coded version of std::all_of() benchmark differently than call to std::all_of()?

Andrei Alexandrescu 在 2015 code::dive 会议“编写快速代码”演讲中受到启发:https://www.youtube.com/watch?v=vrfYLlR8X8k

我采用了 llvm std::all_of 代码并针对直接调用 std::all_of() 对其进行了基准测试。我使用了 google 基准,以及 0 - 65535 个元素的范围(都满足谓词):

#include <benchmark/benchmark.h>

#include <algorithm>

// LLVM std::all_of copied from https://github.com/llvm-mirror/libcxx/blob/master/include/algorithm. Benchmarking this against a direct call to std::all_of().
template <class _InputIterator, class _Predicate>
inline bool
all_of(_InputIterator __first, _InputIterator __last, _Predicate __pred)
{
  for (; __first != __last; ++__first)
    if (!__pred(*__first))
      return false;
  return true;
}

static bool equals_42(int i)
{
    return i == 42;
}

// Benchmark for ::all_of (above)
static void BM_all_of(benchmark::State &state)
{
  std::vector<int> v(state.range(0), 42);
  for (auto _ : state)
    benchmark::DoNotOptimize(::all_of(v.begin(), v.end(), equals_42));
}
// Register the function as a benchmark
BENCHMARK(BM_all_of)->Range(0, 65535);

// Benchmark for std::all_of
static void BM_std_all_of(benchmark::State &state)
{
  std::vector<int> v(state.range(0), 42);
  for (auto _ : state)
    // Note had to use DoNotOptimize because clang++ will optimize std::all_of away otherwise. (I could recreate this with above code by making all_of static or adding __attribute__((internal_linkage))).
    // Note #2: For g++ the effect from DoNotOptimize was marginal, but I left it in to compare apples to apples.
    benchmark::DoNotOptimize(std::all_of(v.begin(), v.end(), equals_42));
}
BENCHMARK(BM_std_all_of)->Range(0, 65535);

BENCHMARK_MAIN();

使用 clang++ 进行基准测试:

$ clang++ -Ofast benchmark/bm_algorithm.cpp -isystem ../../benchmark/include -L ../../benchmark/build/src -lbenchmark -lpthread -o bm; ./bm
2021-07-07T15:55:08-07:00
Running ./bm
Run on (16 X 3194.05 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 0.16, 0.15, 0.12
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_all_of/0              0.248 ns        0.248 ns   1000000000
BM_all_of/1              0.510 ns        0.510 ns   1000000000
BM_all_of/8               2.76 ns         2.76 ns    250915933
BM_all_of/64              28.1 ns         28.1 ns     25130057
BM_all_of/512              136 ns          136 ns      5115542
BM_all_of/4096            1084 ns         1084 ns       678951
BM_all_of/32768           8257 ns         8257 ns        84044
BM_all_of/65535          16655 ns        16655 ns        41223
BM_std_all_of/0          0.520 ns        0.520 ns   1000000000
BM_std_all_of/1          0.516 ns        0.516 ns   1000000000
BM_std_all_of/8           2.37 ns         2.37 ns    290607282
BM_std_all_of/64          13.2 ns         13.2 ns     48996825
BM_std_all_of/512          104 ns          104 ns      6693344
BM_std_all_of/4096         779 ns          779 ns       899729
BM_std_all_of/32768       6205 ns         6205 ns       112868
BM_std_all_of/65535      12387 ns        12387 ns        56428

使用 g++ 进行基准测试:

$ g++ -Ofast benchmark/bm_algorithm.cpp -isystem ../../benchmark/include -L ../../ben
chmark/build_gcc/src -lbenchmark -lpthread -o bm; ./bm
2021-07-07T16:05:38-07:00
Running ./bm
Run on (16 X 3194.05 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 0.00, 0.02, 0.06
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_all_of/0              0.749 ns        0.749 ns    927807313
BM_all_of/1               1.25 ns         1.25 ns    562992870
BM_all_of/8               4.09 ns         4.09 ns    172825424
BM_all_of/64              28.9 ns         28.9 ns     24147840
BM_all_of/512              139 ns          139 ns      5133907
BM_all_of/4096            1074 ns         1074 ns       677743
BM_all_of/32768           8457 ns         8457 ns        84469
BM_all_of/65535          16650 ns        16650 ns        36781
BM_std_all_of/0           2.32 ns         2.32 ns    304514515
BM_std_all_of/1           3.67 ns         3.67 ns    196181853
BM_std_all_of/8           11.0 ns         11.0 ns     63325378
BM_std_all_of/64          75.1 ns         75.1 ns      9310900
BM_std_all_of/512          550 ns          550 ns      1266670
BM_std_all_of/4096        4351 ns         4351 ns       159969
BM_std_all_of/32768      34867 ns        34867 ns        20133
BM_std_all_of/65535      69588 ns        69589 ns        10025

我希望代码在每个编译器上 运行 具有相同的速度。相反:

有人知道为什么这些执行时间可能不同吗?我是对这类事情进行基准测试的新手。

正如上面的评论所说,您正在将 all_of 的 libstdc++ 版本与您复制到代码中的 libc++ 版本进行比较,而 libstdc++ 版本更快。

至于为什么更快:std::all_of calls std::find_if_not, which calls __find_if_not. This function negates the predicate and calls __find_if,将迭代器类别作为附加参数传递给它。

__find_if 有两个重载:一个用于输入迭代器,类似于上面显示的实现。 The other 用于随机访问迭代器。此重载可以计算范围的长度,这允许它展开循环并在每次迭代中执行四次谓词调用。这就是它可以在 std::vector.

等随机访问容器上更快工作的原因