关联性为我们提供了可并行性。但是交换性给出了什么？

Question

Alexander Stepanov 在他的一篇 brilliant lectures at A9（顺便强烈推荐）中指出，关联属性为我们提供了可并行性 – 一个非常有用的以及当今编译器、CPU 和程序员自己可以利用的重要特性：

// expressions in parentheses can be done in parallel
// because matrix multiplication is associative
Matrix X = (A * B) * (C * D);

但是 可交换性 属性给了我们什么呢？重新排序？乱序执行？

Answer 1

一些架构，x86 是一个典型的例子，有指令其中一个来源也是目的地。如果操作后还需要destination的原始值，则需要额外的指令将其复制到另一个寄存器。

交换操作让您（或编译器）可以选择将哪个操作数替换为结果。例如，compiling (with gcc 5.3 -O3 for x86-64 Linux calling convention):

// FP: a,b,c in xmm0,1,2.  return value goes in xmm0
// Intel syntax ASM is  op  dest, src
// sd means Scalar Double (as opposed to packed vector, or to single-precision)
double comm(double a, double b, double c) { return (c+a) * (c+b); }
    addsd   xmm0, xmm2
    addsd   xmm1, xmm2
    mulsd   xmm0, xmm1
    ret
double hard(double a, double b, double c) { return (c-a) * (c-b); }
    movapd  xmm3, xmm2    ; reg-reg copy: move Aligned Packed Double
    subsd   xmm2, xmm1
    subsd   xmm3, xmm0
    movapd  xmm0, xmm3
    mulsd   xmm0, xmm2
    ret
double easy(double a, double b, double c) { return (a-c) * (b-c); }
    subsd   xmm0, xmm2
    subsd   xmm1, xmm2
    mulsd   xmm0, xmm1
    ret

x86 还允许使用内存操作数作为源，因此您可以将负载折叠到 ALU 操作中，如 addsd xmm0, [my_constant]。（将 ALU 操作与内存目标一起使用很糟糕：它必须执行读取-修改-写入。）交换操作为执行此操作提供了更多范围。

x86 的 avx extension (in Sandybridge, Jan 2011) added non-destructive versions of every existing instruction that used vector registers (same opcodes but with a multi-byte VEX prefix replacing all the previous prefixes and escape bytes). Other instruction-set extensions (like BMI/BMI2) also use the VEX coding scheme to introduce 3-operand non-destructive integer instructions, like PEXT r32a, r32b, r/m32: Parallel extract of bits from r32b using mask in r/m32. Result is written to r32a.

AVX 还将向量扩展到 256b 并添加了一些新指令。不幸的是，它远非无处不在，甚至 Skylake Pentium/Celeron CPU 也不支持它。运送假定 AVX 支持的二进制文件是安全的还需要很长时间。 :(

将 -march=native 添加到上面的 godbolt link 中的编译选项，可以看到 AVX 让编译器即使对于 hard() 也只使用 3 条指令。（godbolt 在 Haswell 服务器上运行，因此包括 AVX2 和 BMI2）：

double hard(double a, double b, double c) { return (c-a) * (c-b); }
        vsubsd  xmm0, xmm2, xmm0
        vsubsd  xmm1, xmm2, xmm1
        vmulsd  xmm0, xmm0, xmm1
        ret

Answer 2

这是一个更抽象的答案，不太强调指令级并行性，而更多地强调线程级并行性。

并行中的一个常见 objective 是减少信息。一个简单的例子是两个数组的点积

for(int i=0; i<N; i++) sum += x[i]*[y];

如果操作是关联的，那么我们可以让每个线程计算部分和。那么最后的和就是每个部分和的和。

如果操作是可交换的，则最终的总和可以按任何顺序完成。否则部分和必须按顺序求和。

一个问题是我们不能让多个线程同时写入最终总和，否则会产生竞争条件。因此，当一个线程写入最终总和时，其他线程必须等待。因此，按任何顺序求和会更有效，因为通常很难让每个线程按顺序完成。

让我们选择一个例子。假设有两个线程，因此有两个部分和。

如果操作是可交换的，我们可以得到这种情况

thread2 finishes its partial sum
sum += thread2's partial sum
thread2 finishes writing to sum   
thread1 finishes its partial sum
sum += thread1's partial sum

但是，如果操作不通勤，我们将不得不这样做

thread2 finishes its partial sum
thread2 waits for thread1 to write to sum
thread1 finishes its partial sum
sum += thread1's partial sum
thread2 waits for thread1 to finish writing to sum    
thread1 finishes writing to sum   
sum += thread2's partial sum

这里是 OpenMP 的点积示例

#pragma omp parallel for reduction(+: sum)
for(int i=0; i<N; i++) sum += x[i]*[y];

reduction 子句假定操作（在本例中为 +）是可交换的。大多数人认为这是理所当然的。

如果操作不是可交换的，我们将不得不做这样的事情

float sum = 0;
#pragma omp parallel
{
    float sum_partial = 0 
    #pragma omp for schedule(static) nowait
    for(int i=0; i<N; i++) sum_partial += x[i]*[y];
    #pragma omp for schedule(static) ordered
    for(int i=0; i<omp_get_num_threads(); i++) {
        #pragma omp ordered
        sum += sum_partial;
    }
}

nowait 子句告诉 OpenMP 不要等待每个部分求和完成。 ordered 子句告诉 OpenMP 只按照线程数递增的顺序写入 sum。

此方法以线性方式计算最终总和。但是，它可以在 log2(omp_get_num_threads()) 步内完成。

例如，如果我们有四个线程，我们可以通过三个连续步骤进行缩减

并行计算四个部分和：s1, s2, s3, s4
并行计算：s5 = s1 + s2 线程 1 和 s6 = s3 + s4 线程 2
用线程 1

s5 + s6

这是使用 reduction 子句的一个优势，因为它是一个黑盒子，可以减少 log2(omp_get_num_threads()) 个步骤。 OpenMP 4.0 允许定义自定义缩减。但尽管如此，它仍然假定操作是可交换的。所以它不适合例如。链矩阵乘法。当操作不通勤时，我不知道使用 OpenMP 减少 log2(omp_get_num_threads()) 步骤的简单方法。

关联性为我们提供了可并行性。但是交换性给出了什么？

Associativity gives us parallelizability. But what does commutativity give?

math

parallel-processing

cpu

cpu-architecture

compiler-optimization