内存性能中的存储操作如何工作？

Question

我正在使用 Randal E. Bryant、David R. O'Hallaron - Computer Systems 这本教科书。 A Programmer's Perspective [3rd ed.] (2016, Pearson)，还有一个部分我不是很了解。

C代码：

void write_read(long *src, long *dst, long n)
{
 long cnt = n;
 long val = 0;

 while (cnt) {
  *dst = val;
  val = (*src)+1;
  cnt--;
 }
}

write_read 的内循环：

#src in %rdi, dst in %rsi, val in %rax
 .L3: 
    movq %rax, (%rsi)  # Write val to dst
    movq (%rdi), %rax  # t = *src
    addq , %rax      # val = t+1
    subq , %rdx      # cnt--
    jne .L3            # If != 0, goto loop

鉴于这段代码，教科书给出了这张图来描述程序流程

这是给出的解释，对于那些无法访问 TB 的人：

Figure 5.35 shows a data-flow representation of this loop code. The instruction movq %rax,(%rsi) is translated into two operations: The s_addr instruction computes the address for the store operation, creates an entry in the store buffer, and sets the address field for that entry. The s_data operation sets the data field for the entry. As we will see, the fact that these two computations are performed independently can be important to program performance. This motivates the separate functional units for these operations in the reference machine.

In addition to the data dependencies between the operations caused by the writing and reading of registers, the arcs on the right of the operators denote a set of implicit dependencies for these operations. In particular, the address computation of the s_addr operation must clearly precede the s_data operation.

In addition, the load operation generated by decoding the instruction movq (%rdi), %rax must check the addresses of any pending store operations, creating a data dependency between it and the s_addr operation. The figure shows a dashed arc between the s_data and load operations. This dependency is conditional: if the two addresses match, the load operation must wait until the s_data has deposited its result into the store buffer, but if the two addresses differ, the two operations can proceed independently.

a) 我不太清楚的是为什么在 movq %rax,(%rsi) 这行之后需要在调用 s_data 之后完成 load？我假设当调用 s_data 时，%rax 的值存储在 %rsi 的地址指向的位置？这是否意味着在每个 s_data 之后都需要一个 load 调用？

b) 它并没有真正显示在图中，但据我从书中给出的解释中理解，movq (%rdi), %rax 这条线需要它自己的一组 s_addr 和 s_data？因此，所有 movq 调用都需要 s_addr 和 s_data 调用，然后检查地址是否匹配，然后再调用 load 是否准确？

对这些部分很困惑，如果有人能解释 s_addr 和 s_data 调用如何与 load 一起工作以及何时需要这些功能，我将不胜感激，谢谢!!

Answer 1

蓝色方框内的操作是流水线解码器发出的微操作（也叫uops或微指令）。它们是正在执行的程序的一部分。 movq (%rdi), %rax 指令被解码为加载微指令。 uop 是管道中的执行单元。 Uop 不是 调用的 ，它们是 执行的 。

根据书中讨论的假想处理器设计，像 movq %rax, (%rsi) 这样的简单存储指令被解码为两个微指令，称为 s_addr 和 s_data。这也发生在真正的 x86 处理器中。宏指令可能被解码为多个 uop 的一个原因是 uop 的格式不允许它保存指令中给出的所有信息，例如当指令有太多操作数或表示复杂任务时.另一个原因是增加指令级并行性。商店的地址和商店的数据可以在不同的周期中变得可用。如果地址可用但数据不可用，则可以将 s_addr uop 分派到加载-存储单元，以使下游加载 uops 的地址能够更早地与商店的地址，而不必等待商店的数据。确定后面的加载是否依赖于前面的存储的过程称为内存消歧。如果加载 movq (%rdi), %rax 与之前的存储 movq %rax, (%rsi) 不重叠，则可以立即执行，而不管 %rax 中的值是否准备就绪。

执行s_datauop时，%rax中的值存储在store uop所在的store buffer entry的data字段中被分配了。在所有较早的指令完成执行以维护程序顺序之后，将值存储在目标内存位置中。

书上说“s_addr操作的地址计算必须明确在s_data操作之前”可能是因为根据书，s_addr uop 必须先在存储缓冲区中创建一个条目，然后才能将数据存储在其中。这对于假设的设计可能没问题，但这是一种不必要的依赖，因为分配可以在执行之前完成。反正书中没有讨论资源分配和回收。

一个简单的加载指令被解码成一个单一的加载uop。没有理由将负载拆分为多个 uops。

内存性能中的存储操作如何工作？

how does store operation in memory performance work?

optimization

x86

assembly

cpu-architecture