模索引对性能有很大的影响

Question

我有一个简单的代码，可以对数组中的元素和 returns 它们求和：

// Called with jump == 0
int performance(int jump, int *array, int size) {
  int currentIndex = 0;
  int total = 0;
  // For i in 1...500_000_000
  for (int i = 0; i < 500000000; i++) {
    currentIndex = (currentIndex + jump) % size;
    total += array[currentIndex];
  }
  return total;
}

我注意到一个奇怪的行为：% size 的存在对性能有非常大的影响（慢 10 倍左右）即使 jump 是 0 所以它不断访问相同的数组元素 (0)。只需删除 % size 即可大大提高性能。

我原以为这只是造成这种差异的模计算，但现在说我用 total += array[currentIndex] % size; 替换了我的求和线（因此也计算了模）性能差异几乎不明显。

我正在 arm64 机器上用 -O3 和 clang 编译这个。

可能是什么原因造成的？

Answer 1

听起来正常 sdiv+msub 延迟约为 10 倍 add 延迟。

即使内联编译时间常数 size 不是 2 的幂，它仍然是 multiplicative inverse 和 msub（乘减）得到余数，所以至少有两个乘法和一个移位的深度链。

由于数组也是有符号的 int，因此对于具有恒定大小（即使是正数）的带符号余数，可能在关键路径上有额外的几条指令。例如-4 % 3 必须在 C 中生成 -1。

见

How many CPU cycles are needed for each assembly instruction?
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

say I replace my sum line with total += array[currentIndex] % size; (thus also computing a modulo)

其余部分不是循环携带的依赖链的一部分。 (https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/)

多个余数计算可以并行进行，因为下一个 array[idx] 加载地址仅取决于 += jump 加法指令。

如果您不在吞吐量限制上遇到瓶颈，那么这些剩余结果可能会以 1/clock 吞吐量准备就绪，OoO exec 在迭代之间重叠 dep 链。唯一的延迟瓶颈是循环 counter/index 和 total += ...，它们都是整数 add，具有 1 个周期延迟。

所以说真的，瓶颈可能会出现在（整个循环体的）吞吐量上，而不是那些延迟瓶颈，除非你在可以完成很多工作的非常广泛的 CPU 上进行测试每个周期。（令人惊讶的是，你根本没有因为引入 % 而减速。如果你不使用结果，除非 total 在内联后得到优化。）

模索引对性能有很大的影响

Indexing with modulo has a huge performance hit

c

performance

assembly

arm64