C++ 如何仅使用 MOV 在 x86 上实现发布和获取?

C++ How is release-and-acquire achieved on x86 only using MOV?

这个问题是follow-up/clarification到这个:

这表明 MOV 汇编指令足以在 x86 上执行获取-释放语义。我们不需要 LOCK、围栏或 xchg 等。但是,我很难理解这是如何工作的。

英特尔文档第 3A 卷第 8 章指出:

https://software.intel.com/sites/default/files/managed/7c/f1/253668-sdm-vol-3a.pdf

In a single-processor (core) system....

  • Reads are not reordered with other reads.
  • Writes are not reordered with older reads.
  • Writes to memory are not reordered with other writes, with the following exceptions:

但这是针对单核的。多核部分似乎没有提到负载是如何强制执行的:

In a multiple-processor system, the following ordering principles apply:

  • Individual processors use the same ordering principles as in a single-processor system.
  • Writes by a single processor are observed in the same order by all processors.
  • Writes from an individual processor are NOT ordered with respect to the writes from other processors.
  • Memory ordering obeys causality (memory ordering respects transitive visibility).
  • Any two stores are seen in a consistent order by processors other than those performing the stores
  • Locked instructions have a total order.

那么单凭MOV如何促进获取-释放?

刷新 acquire 和 release 的语义(引用 cppreference 而不是标准,因为它是我手头的 - 标准更......详细,在这里):

memory_order_acquire: A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread

memory_order_release: A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable

这给了我们四点保证:

  • 获取排序:"no reads or writes in the current thread can be reordered before this load"
  • 发布订单:"no reads or writes in the current thread can be reordered after this store"
  • 获取-释放同步:
    • "all writes in other threads that release the same atomic variable are visible in the current thread"
    • "all writes in the current thread are visible in other threads that acquire the same atomic variable"

审查保证:

  • Reads are not reordered with other reads.
  • Writes are not reordered with older reads.
  • Writes to memory are not reordered with other writes [..]
  • Individual processors use the same ordering principles as in a single-processor system.

这足以满足订购保证。

对于获取排序,考虑已发生原子读取:对于该线程,显然之前的任何后续读取或写入迁移将分别违反第一个或第二个要点.

对于发布排序,考虑发生了原子写入:对于该线程,显然任何之前的读取或写入迁移都会分别违反第二个或第三个要点.

唯一剩下的事情就是确保如果一个线程读取一个已释放的存储,它会看到写入线程在此时产生的所有其他负载。这是需要其他多处理器保证的地方。


  • Writes by a single processor are observed in the same order by all processors.

这足以满足获取-释放同步。

我们已经确定,当发布写入发生时,它之前的所有其他写入也将发生。然后,此要点确保 如果另一个线程读取已释放的写入 ,它将读取作者在该点之前生成的所有写入。 (如果没有,那么它将观察到单个处理器的写入顺序与单个处理器的写入顺序不同,违反了要点。)

but this is for a single core. The multi-core section does not seem to mention how loads are enforced:

该部分的第一个要点是关键:各个处理器使用与单处理器系统相同的排序原则。该语句的隐含部分是 ...当loading/storing来自缓存一致的共享内存时。即多处理器系统不引入重新排序的新方法,它们只是意味着可能的观察者现在包括其他内核上的代码只有 DMA / IO 设备。

重新排序共享内存访问的模型是单核模型,即程序顺序 + 存储缓冲区 = 基本上 acq_rel。其实比acq_rel稍微强一点就好了

唯一发生的重新排序是 local,在每个 CPU core 中。一旦商店变得全局可见,它就会同时对所有其他核心可见,并且在此之前不会对任何核心可见。 (除了通过存储转发执行存储的核心。)这就是为什么只有本地屏障足以恢复 SC + 存储缓冲区模型之上的顺序一致性。 (对于 x86,在 SC 存储后 mo_seq_cst 只需要 mfence,以便在执行任何进一步的加载之前耗尽存储缓冲区。 mfencelocked 指令(也是全屏障)不必打扰其他内核,只需让这个等待即可。

需要理解的一个关键点是,所有处理器共享的一致内存共享视图(通过一致缓存)。 英特尔 SDM 第 8 章的最顶部定义了一些背景:

These multiprocessing mechanisms have the following characteristics:

  • To maintain system memory coherency — When two or more processors are attempting simultaneously to access the same address in system memory, some communication mechanism or memory access protocol must be available to promote data coherency and, in some instances, to allow one processor to temporarily lock a memory location.
  • To maintain cache consistency — When one processor accesses data cached on another processor, it must not receive incorrect data. If it modifies data, all other processors that access that data must receive the modified data.
  • To allow predictable ordering of writes to memory — In some circumstances, it is important that memory writes be observed externally in precisely the same order as programmed.
  • [...]

The caching mechanism and cache consistency of Intel 64 and IA-32 processors are discussed in Chapter 11.

(CPUs 使用了 MESI 的一些变体;Intel 在实践中使用 MESIF,AMD 在实践中使用 MOESI。)

同一章还包括一些有助于说明/定义内存模型的试金石。您引用的部分并不是内存模型的严格 正式 定义。但是 8.2.3.2 加载和存储均未使用类似操作重新排序 部分显示加载未使用加载重新排序。另一部分还显示 LoadStore reordering is forbidden. Acq_rel is basically blocking all reordering except StoreLoad, and that's what x86 does. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120930/weak-vs-strong-memory-models/)

相关:

  • - 询问为什么 acq_rel 不需要障碍,但从不同的角度来看(想知道数据如何对其他核心可见)。
  • How do memory_order_seq_cst and memory_order_acq_rel differ?(seq_cst 需要刷新存储缓冲区)。
  • program-order + store buffer 与 acq_rel 不完全相同,尤其是当您考虑仅部分重叠最近存储的负载时。
  • x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors - x86 的正式内存模型。

其他 ISA

一般来说,大多数较弱的内存硬件模型也只允许本地重新排序,因此障碍仍然只在 CPU 核心内局部存在,只是让该核心(的一部分)等待某个条件。 (例如,x86 mfence 阻止以后的加载和存储执行,直到存储缓冲区耗尽。其他 ISA 也受益于轻量级屏障以提高 x86 在每个内存操作之间强制执行的东西的效率,例如阻止 LoadLoad 和 LoadStore 重新排序。https://preshing.com/20120930/weak-vs-strong-memory-models/ )

一些 ISA(现在只有 PowerPC)允许存储在对所有内核可见之前对某些其他内核可见,。请注意,C++ 中的 mo_acq_rel 允许 IRIW 重新排序;只有 seq_cst 禁止它。大多数 HW 内存模型比 ISO C++ 稍微强一点,因此不可能,因此所有内核都同意存储的全局顺序。