mov r64, m64 是一个周期还是两个周期延迟？

Question

我在 IvyBridge 上，我写了下面这个简单的程序来测量 mov 的延迟：

section .bss
align   64
buf:    resb    64

section .text
global _start
_start:
    mov rcx,    1000000000
    xor rax,    rax
loop:
    mov rax,    [buf+rax]

    dec rcx,
    jne loop

    xor rdi,    rdi
    mov rax,    60
    syscall

perf 显示结果：

 5,181,691,439      cycles

因此每次迭代都有 5 个周期延迟。我从多个在线资源中搜索，L1 缓存的延迟是 4。因此 mov 本身的延迟应该是 1。

然而，Agner 指令 table 显示 mov r64, m64 对于 IveBridge 有 2 个周期的延迟。我不知道在其他地方可以找到这种延迟。

我上面的测量程序有误吗？为什么这个程序显示 mov 延迟是 1 而不是 2？

（我通过使用二级缓存得到了相同的结果：如果 buf+rax 是一级缺少二级命中，类似的测量显示 mov rax, [buf+rax] 有 12 个周期延迟。IvyBridge 有 11 个周期延迟二级缓存，所以mov 延迟仍然是 1 个周期）

Answer 1

Therefore the latency of mov itself should be 1.

不，mov是负载。也没有数据必须经过的 ALU mov 操作。

Agner Fog 的指令 tables 不包含加载使用延迟（就像您正在测量的那样）。 它们在他的 microarch PDF 中 tables 在每个 uarch 的“缓存和内存访问”部分。例如SnB/IvB（第 9.13 节）有一个“1 级数据”行，其中“32 kB，8 路，64 B 线路大小，延迟 4，每个核心”。

这个 4 周期延迟是 mov rax, [rax] 等相关指令链的加载使用延迟。 您正在测量 5 个周期，因为您使用的寻址模式不是 [reg + 0..2047]。 对于小位移，负载单元推测直接使用基址寄存器作为输入TLB 查找将给出与使用加法器结果相同的结果。。所以你的寻址模式[disp32 + rax]使用正常路径，在加载端口开始TLB查找之前等待一个加法器结果的周期。

对于不同域之间的大多数操作（如整数寄存器和 XMM 寄存器），您只能真正测量像 movd xmm0,eax / mov eax, xmm0 这样的往返，并且很难将其分开分别算出每条指令的延迟是多少¹.

对于负载，您可以链接到另一个负载来测量缓存负载使用延迟，而不是 store/reload.

的链

Agner 出于某种原因决定仅查看他的 table 的存储转发延迟，并做出完全任意的选择如何在存储和重新加载之间拆分存储转发延迟。

(from the "definition of terms" sheet of his instruction table spreadsheet, way at the left after the Introduction)

It is not possible to measure the latency of a memory read or write instruction with software methods. It is only possible to measure the combined latency of a memory write followed by a memory read from the same address. What is measured here is not actually the cache access time, because in most cases the microprocessor is smart enough to make a "store forwarding" directly from the write unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables. But in fact, the only value that makes sense to performance optimization is the sum of the write time and the read time.

这显然是不正确的：L1d 加载使用延迟是通过间接级别追逐指针的事情。您可能会争辩说它只是可变的，因为某些加载可能会在缓存中丢失，但是如果您要选择一些内容放入 table 中，您也可以选择 L1d 加载使用延迟。然后像现在一样计算存储延迟数，使得存储+加载延迟=存储转发延迟。 Intel Atom 的存储延迟 = -2，因为它有 3c L1d load-use latency，但根据 Agner 的 uarch 指南，存储转发为 1c。

例如，这对于加载到 XMM 或 YMM 寄存器中不太容易，但一旦计算出 movq rax, xmm0 的延迟，这仍然是可能的。 x87 寄存器更难，因为没有办法通过 ALU 直接从 st0 获取数据到 eax/rax，而不是 store/reload。但也许你可以用像 fucomi 这样的 FP 比较来做一些事情，它直接设置整数 FLAGS（在有它的 CPU 上：P6 和更高版本）。

不过，至少整数加载延迟反映指针追逐延迟会好得多。 IDK 如果有人愿意为他更新 Agner 的 table，或者他是否会接受这样的更新。不过，需要对大多数 uarche 进行新的测试，以确保您对不同的寄存器组具有正确的加载使用延迟。

脚注 1：例如，http://instlatx64.atw.hu doesn't try, and just says "diff. reg. set" in the latency column, with useful data only in the throughput column. But they have lines for the MOVD r64, xmm+MOVD xmm, r64 round trip, in this case 在 IvB 上总共有 2 个循环，因此我们可以非常确信它们单程只有 1c。不是零的一种方式。 :P

但是对于整数寄存器的加载，它们确实显示了 MOV r32, [m32] 的 IvB 的 4 周期加载使用延迟，因为显然它们使用 [reg + 0..2047] 寻址模式进行了测试。

https://uops.info/ is quite good, but gives pretty loose bounds on latency: IIRC, they construct a loop with a round trip (e.g. store and reload, or xmm->integer and integer->xmm), and then give an upper bound on latency assumed that every other step was only 1 cycle. See 更多。

缓存延迟信息的其他来源：

https://www.7-cpu.com/ 有很多其他 uarches 的详细信息，甚至许多非 x86，如 ARM、MIPS、PowerPC 和 IA-64。

这些页面还有其他详细信息，例如高速缓存和 TLB 大小、TLB 时序、分支未命中实验结果和内存带宽。缓存延迟详细信息如下所示：

(from their Skylake page)

L1 Data Cache Latency = 4 cycles for simple access via pointer

L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).

L2 Cache Latency = 12 cycles

L3 Cache Latency = 42 cycles (core 0) (i7-6700 Skylake 4.0 GHz)

L3 Cache Latency = 38 cycles (i7-7700K 4 GHz, Kaby Lake)

RAM Latency = 42 cycles + 51 ns (i7-6700 Skylake)

mov r64, m64 是一个周期还是两个周期延迟？

Is mov r64, m64 one cycle or two cycle latency?

x86

assembly

microbenchmark

cpu-cache

micro-architecture