变量累积的循环向量化

Vectorization of loop with accumulation in variable

我有以下正在使用 icc 编译的循环

for (int i = 0; i < arrays_size; ++i) {
      total = total + C[i];
}

矢量化报告说这个循环已经被矢量化了,但我不明白这是怎么可能的,因为有明显的先读后写依赖。

报告输出如下:

LOOP BEGIN at loops.cpp(46,5)
      remark #15388: vectorization support: reference C has aligned access   [ loops.cpp(47,7) ]
      remark #15305: vectorization support: vector length 4
      remark #15399: vectorization support: unroll factor set to 8
      remark #15309: vectorization support: normalized vectorization overhead 0.475
      remark #15300: LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 1
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 5
      remark #15477: vector loop cost: 1.250
      remark #15478: estimated potential speedup: 3.990
      remark #15488: --- end vector loop cost summary ---
      remark #25015: Estimate of max trip count of loop=31250
   LOOP END

谁能解释一下这意味着什么以及如何向量化这个循环?

根据 totalC[i] 的类型,您可以利用加法的结合律和交换律以及第 4 或 8(或更多)个小计的总和。

int subtotal[4] = {0,0,0,0};
for (int i = 0; i < arrays_size; i+=4) {
    for(int k=0; k<4; ++k)
        subtotal[k] += C[i+k];
}
// handle remaining elements of C, if necessary ...
// sum-up sub-totals:
total = (subtotal[0]+subtotal[2]) + (subtotal[1]+subtotal[3]);

这适用于任何整数类型,但默认情况下 ICC 假定浮点加法也是关联的(gcc 和 clang 为此需要 -ffast-math 的某些子集)。