CPU 的内部并行化

Question

我一直在研究 Xorshift* 随机数生成器，并且 this 探索了它们的属性。从该站点引用（强调我的）：

How can a xorshift64* generator be slower than a xorshift1024* generator?

Dependencies. The three xor/shifts of a xorshift64* generator must be executed sequentially, as each one is dependent on the result of the previous one. In a xorshift1024* generator two of the xor/shifts are completely independent and can be parallelized internally by the CPU. I also suspect that the larger state space makes it possible for the CPU to perform more aggressively speculative execution (indeed, a xorshift128* generator is slower than a xorshift1024* generator).

CPU 语句内部的这种并行化是什么意思？我的意思是 CPU 将使用向量指令同时执行两个 xor/shifts，但我无法在编译器的汇编输出中看到这方面的证据。这是一个很深的 CPU 流水线吗？或者我应该能够看到生成的汇编程序中发生的事情吗？

Answer 1

是的，这是 instruction-level parallelism 的事情。

基本上这样的 CPU 将有比每条指令所需的更多的可用执行硬件，因此它 "spreads out" 在可用资源上执行一堆指令，然后将结果合并回来，以便, 对于程序员来说，它看起来仍然是按顺序发生的。

如果你擅长的话，你可以看到两条相邻的指令，它们都可以工作，但没有依赖性。例如，它们可能仅在非重叠的寄存器组上运行。对于这种情况，您可以猜测它们可能是并行执行的，从而导致该特定代码位的每周期指令值很高。

为了更具体一点，让我们看一下您正在谈论的两段代码（同时：我的学习机会）。

这里是xorshift64*的核心：

x ^= x >> 12; // a
x ^= x << 25; // b
x ^= x >> 27; // c
return x * 2685821657736338717LL;

实际上，这就是函数中的所有代码（x 是一个 uint64_t）。很明显，每一行都涉及到状态并对其进行修改，因此每条语句都依赖于它之前的语句。相比之下，这里是 xorshift1024+:

uint64_t s0 = s[ p ];
uint64_t s1 = s[ p = ( p + 1 ) & 15 ];
s1 ^= s1 << 31; // a
s1 ^= s1 >> 11; // b
s0 ^= s0 >> 30; // c
return ( s[ p ] = s0 ^ s1 ) * 1181783497276652981LL;

这里，全局状态在 uint64_t s[16], p 变量中。鉴于此，可能 crystal 不清楚但至少有些暗示，带有 // c 注释的行 not 与它之前的行共享任何状态.因此，它同时进行移位和 XOR（即 "work"），这与之前正在完成的类似工作是独立的。因此，超标量处理器可能能够运行这两条线或多或少地并行。

CPU 的内部并行化

Internal parallelization by CPU

c

c++

performance

prng