依赖链分析

Question

来自 Agner Fog's "Optimizing Assembly" guide，第 12.7 节：循环示例。讨论示例代码的段落之一：

[...] Analysis for Pentium M: ... 13 uops at 3 per clock = one iteration per 4.33c retirement time.

There is a dependency chain in the loop. The latencies are: 2 for memory read, 5 for multiplication, 3 for subtraction, and 3 for memory write, which totals 13 clock cycles. This is three times as much as the retirement time but it is not a loop-carried dependence because the results from each iteration are saved to memory and not reused in the next iteration. The out-of-order execution mechanism and pipelining makes it possible that each calculation can start before the preceding calculation is finished. The only loop-carried dependency chain is add eax,16 which has a latency of only 1.

## Example 12.6b.  DAXPY algorithm, 32-bit mode
[...]   ; not shown: initialize some regs before the loop
L1:
    movapd xmm1, [esi+eax]   ; X[i], X[i+1]
    mulpd  xmm1, xmm2        ; X[i] * DA, X[i+1] * DA
    movapd xmm0, [edi+eax]   ; Y[i], Y[i+1]
    subpd  xmm0, xmm1        ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
    movapd [edi+eax], xmm0   ; Store result
    add eax, 16              ; Add size of two elements to index
    cmp eax, ecx             ; Compare with n*8
    jl L1                    ; Loop back

我不明白为什么依赖链没有增加整个吞吐量。我知道只有找到最严重的瓶颈才是重要的。在考虑依赖链之前确定的最严重瓶颈是融合域 uop 吞吐量，每次迭代 4.33 个周期。我不明白为什么依赖链不是比这更大的瓶颈。

我看到作者说和乱序执行和流水线有关，但是我看不到。不过，我的意思是，只有乘法会导致延迟 5 个周期，因此只有这个值大于 4 个周期。
我也无法理解为什么作者不关心这里的依赖关系： add eax, 16 -> cmp eax, ecx -> jl L1 毕竟加法必须在cmp之前执行，cmp必须在jl之前执行。

PS：后面的段落将 Pentium M 的最大瓶颈确定为解码，将其限制为每 6c 迭代一次，因为 128b 向量运算每个解码为两个微指令。请参阅 Agner Fog 的指南以了解其余分析以及 Core2、FMA4 Bulldozer 和 Sandybridge 的分析 + 调整。

Answer 1

mul 不是 循环携带 依赖链的一部分，因此可以有来自多个迭代的 mulpd insns 在一次。单个指令的延迟根本不是这里的问题，它是依赖性 chain。每次迭代都有一个 separate 13c 加载、mulpd、subpd、存储的依赖链。乱序执行允许来自多个迭代的微指令同时运行。
每次迭代中的cmp/jl依赖于那次迭代的add，但下一次迭代中的add不依赖' t 取决于 cmp。推测执行和分支预测意味着控制依赖（条件分支和间接 jumps/calls）是数据依赖链的而非部分。这就是为什么来自一个迭代的指令可以在前一迭代的 jl 退出之前开始运行。

相比之下，cmov 是数据依赖而不是控制依赖，因此无分支循环往往具有循环携带的依赖链。如果分支预测良好，这往往比分支慢。

每个循环迭代都有一个单独的cmp/jl依赖链，就像FP依赖链一样。

I cannot understand why dependency chain doesn't increase a whole throughput.

我不知道这句话是什么意思。我想我能够弄清楚你所有其他混淆的单词和措辞。（例如 "chain dependency" 而不是 "dependency chain"。）看看我对你的问题的编辑；其中一些也可能有助于您的理解。

依赖链分析

Dependency chain analysis

performance

x86

assembly

cpu-architecture

micro-optimization