对纯 C++ 函数进行基准测试

Question

如何防止 GCC/Clang 内联和优化纯函数的多次调用？

我正在尝试对这种形式的代码进行基准测试

int __attribute__ ((noinline)) my_loop(int const* array, int len) {
   // Use array to compute result.
 }

我的基准代码看起来像这样：

int main() {
  const int number = 2048;
   // My own aligned_malloc implementation.
  int* input = (int*)aligned_malloc(sizeof(int) * number, 32);
  // Fill the array with some random numbers.
  make_random(input, number);
  const int num_runs = 10000000;
  for (int i = 0; i < num_runs; i++) {
     const int result = my_loop(input, number); // Call pure function.
  }
  // Since the program exits I don't free input.
}

正如预期的那样，Clang 似乎能够将其变成 O2 的空操作（甚至可能在 O1）。

我尝试对我的实施进行实际基准测试的几件事是：

将中间结果累加成一个整数，最后打印结果：

const int num_runs = 10000000;
uint64_t total = 0;
for (int i = 0; i < num_runs; i++) {
  total += my_loop(input, number); // Call pure function.
}
printf("Total is %llu\n", total);

遗憾的是，这似乎不起作用。 Clang 至少足够聪明，可以意识到这是一个纯函数，并将基准转换为如下形式：

int result = my_loop();
uint64_t total = num_runs * result;
printf("Total is %llu\n", total);

在每次循环迭代结束时使用释放语义设置原子变量：

const int num_runs = 10000000;
std::atomic<uint64_t> result_atomic(0);
for (int i = 0; i < num_runs; i++) {
  int result = my_loop(input, number); // Call pure function.
  // Tried std::memory_order_release too.
  result_atomic.store(result, std::memory_order_seq_cst);
}
printf("Result is %llu\n", result_atomic.load());

我的希望是，由于原子引入了 happens-before 关系，Clang 将被迫执行我的代码。但遗憾的是，它仍然进行了上述优化，并一次性将原子的值设置为 num_runs * result，而不是函数的运行ning num_runs 次迭代。

在每个循环结束时设置一个 volatile int 并求和。

const int num_runs = 10000000;
uint64_t total = 0;
volatile int trigger = 0;
for (int i = 0; i < num_runs; i++) {
  total += my_loop(input, number); // Call pure function.
  trigger = 1;
}
// If I take this printf out, Clang optimizes the code away again.
printf("Total is %llu\n", total);

这似乎可以解决问题，而且我的基准测试似乎有效。由于多种原因，这并不理想。

根据我对 C++11 内存模型的理解 volatile set operations 不建立 happens before 关系所以我不能确定某些编译器不会决定做同样的 num_runs * result_of_1_run 优化 .
而且这种方法似乎不可取，因为现在我有一个开销（无论多么小）在我的循环的每个运行上设置一个 volatile int。

是否有一种规范的方法可以防止 Clang/GCC 优化此结果。也许用编译指示什么的？如果这个理想的方法跨编译器工作，加分。

Answer 1

您可以将指令直接插入到程序集中。我有时会使用宏来拆分程序集，例如将负载与计算和分支分开。

#define GCC_SPLIT_BLOCK(str)  __asm__( "//\n\t// " str "\n\t//\n" );

然后在源代码中插入

GCC_SPLIT_BLOCK("Keep this please")

函数前后

对纯 C++ 函数进行基准测试

Benchmarking a pure C++ function

c++

benchmarking

gcc

clang