lock xchg 是否具有与 mfence 相同的行为?

Does lock xchg have the same behavior as mfence?

我想知道的是 lock xchg 是否会有与 mfence 类似的行为,从一个线程访问正在被其他线程改变(让我们随机说)的内存位置的角度来看线程。它能保证我获得最新的价值吗?内存read/write之后的指令?

我困惑的原因是:

8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.”

-Intel 64 Developers Manual Vol. 3

这是否适用于跨线程?

mfence 状态:

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).

-Intel 64 Developers Manual Vol 3A

听起来更有力。听起来 mfence 几乎要刷新写入缓冲区,或者至少接触到写入缓冲区和其他核心以确保我的未来 load/stores 是最新的。

当对两条指令进行基准测试时,都需要约 100 个周期才能完成。所以我看不出有多大区别。

主要是我很困惑。我的指令基于互斥体中使用的 lock,但这些指令不包含内存栅栏。然后我看到 lock free 编程使用内存栅栏,但没有锁。我知道 AMD64 有一个非常强大的内存模型,但陈旧的值可以保留在缓存中。如果 lock 的行为与 mfence 不同,那么互斥量如何帮助您查看最新值?

我相信您的问题与询问 mfence 是否与 x86 上的 lock 前缀指令具有相同的屏障语义相同,或者它是否提供更少1 或在某些情况下提供额外保证。

我目前的最佳答案是英特尔的 intent 并且 ISA 文档保证 mfencelocked 指令提供相同的防护语义,但由于实施疏忽,mfence 实际上在最近的硬件上提供了更强大的防护语义(至少从 Haswell 开始)。特别是,mfence 可以从 WC 类型的内存区域隔离后续的 非临时加载 ,而 locked 指令则不会。

我们知道这一点,因为英特尔在处理器勘误表中告诉我们这一点,例如 HSD162 (Haswell) and SKL155 (Skylake),它告诉我们锁定指令不会阻止从 WC 内存中进行后续的非临时读取:

MOVNTDQA From WC Memory May Pass Earlier Locked Instructions

Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an earlier locked instruction that accesses a different cache line.

Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly.

Workaround: None identified. Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA should insert an MFENCE instruction between the locked instruction and subsequent (V)MOVNTDQA instruction.

据此,我们可以确定 (1) Intel 可能打算 锁定的指令会阻止 NT 从 WC 型内存加载,否则这不会是勘误表0.5 和 (2) 锁定指令 实际上这样做,英特尔无法或选择不使用微代码修复此问题更新,建议改为mfence

在 Skylake 中,mfence 根据 SKL079:实际上,mfence 失去了它在 NT 负载方面的额外防护能力:来自 WC 内存的 MOVNTDQA 可能会通过更早的 MFENCE 指令 - 这个与 lock 指令勘误表具有几乎相同的文本,但适用于 mfence。但是,此勘误表的状态为 "It is possible for the BIOS to contain a workaround for this erratum.",通常英特尔称其为 "a microcode update addresses this"。

这一系列勘误也许可以用时间来解释:Haswell 勘误表仅在 2016 年初出现,即该处理器发布多年后,因此我们可以假设该问题在适度时间之前引起了英特尔的注意那。在这一点上,Skylake 几乎可以肯定已经在野外了,显然 mfence 实现不那么保守,也没有在 WC 类型的内存区域上隔离 NT 负载。修复锁定指令的工作方式一直到 Haswell 可能是不可能的,或者基于它们的广泛使用而昂贵,但需要某种方式来隔离 NT 负载。 mfence 显然已经在 Haswell 上完成了这项工作,Skylake 将得到修复,以便 mfence 也能在那里工作。

这并不能真正解释为什么 SKL079(mfence 一个)出现在 2016 年 1 月,比 SKL155(locked 一个)出现在 2017 年底晚了将近两年,或者为什么后者出现然而,在相同的 Haswell 勘误表之后出现了这么多。

人们可能会猜测英特尔未来会做什么。由于他们没有 able/willing 通过 Skylake 更改 Haswell 的 lock 指令,代表数亿(十亿?)已部署的芯片,他们永远无法保证锁定的指令会阻止 NT 负载,因此他们可能会考虑将其作为将来记录的架构行为。或者他们可能会更新锁定的指令,因此他们会屏蔽此类读取,但实际上,您可能不能依赖它十年或更长时间,直到具有当前非屏蔽行为的芯片几乎不再流通。

与Haswell类似,根据BV116 and BJ138,NT负载可能会分别在Sandy Bridge和Ivy Bridge上传递较早的锁定指令。早期的微体系结构也可能遇到此问题。这个"bug"在Skylake之后的Broadwell和微架构中似乎不存在。

Peter Cordes 在 this answer 末尾写了一些关于 Skylake mfence 变化的文章。

这个答案的其余部分是我原来的答案,在我知道勘误表之前,主要是为了历史兴趣。

旧答案

我对答案的知情猜测是 mfence 提供了额外的屏障功能:在使用弱排序指令(例如 NT 存储)的访问之间,也许在访问弱排序的 regions(例如,WC 型内存)。

也就是说,这只是一个有根据的猜测,您会在下面找到我调查的详细信息。

详情

文档

尚不清楚 mfence 的内存一致性影响与 lock 前缀指令(包括 xchg 和内存操作数,即隐式锁定)。

我认为可以肯定地说,仅就回写内存区域而言,不涉及任何非临时访问,mfence 提供与 lock 前缀操作相同的排序语义.

有争议的是 mfence 是否完全不同于 lock-prefixed instructions 当涉及到上述以外的场景时,特别是当访问涉及 WB 区域以外的区域时或当涉及非时间(流)操作。

例如,你可以找到一些建议(例如here or here),当涉及WC类型的操作(例如NT商店)时,mfence意味着强屏障语义。

例如,在 this thread 中引用 McCalpin 博士(强调已添加):

The fence instruction is only needed to be absolutely sure that all of the non-temporal stores are visible before a subsequent "ordinary" store. The most obvious case where this matters is in a parallel code, where the "barrier" at the end of a parallel region may include an "ordinary" store. Without a fence, the processor might still have modified data in the Write-Combining buffers, but pass through the barrier and allow other processors to read "stale" copies of the write-combined data. This scenario might also apply to a single thread that is migrated by the OS from one core to another core (not sure about this case).

I can't remember the detailed reasoning (not enough coffee yet this morning), but the instruction you want to use after the non-temporal stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the SWDM, the MFENCE is the only fence instruction that prevents both subsequent loads and subsequent stores from being executed ahead of the completion of the fence. I am surprised that this is not mentioned in Section 11.3.1, which tells you how important it is to manually ensure coherence when using write-combining, but does not tell you how to do it!

让我们看看参考的 Intel SDM 的第 8.2.5 节:

Strengthening or Weakening the Memory-Ordering Model

The Intel 64 and IA-32 architectures provide several mechanisms for strengthening or weakening the memory- ordering model to handle special programming situations. These mechanisms include:

• The I/O instructions, locking instructions, the LOCK prefix, and serializing instructions force stronger ordering on the processor.

• The SFENCE instruction (introduced to the IA-32 architecture in the Pentium III processor) and the LFENCE and MFENCE instructions (introduced in the Pentium 4 processor) provide memory-ordering and serialization capabilities for specific types of memory operations.

These mechanisms can be used as follows:

Memory mapped devices and other I/O devices on the bus are often sensitive to the order of writes to their I/O buffers. I/O instructions can be used to (the IN and OUT instructions) impose strong write ordering on such accesses as follows. Prior to executing an I/O instruction, the processor waits for all previous instructions in the program to complete and for all buffered writes to drain to memory. Only instruction fetch and page tables walks can pass I/O instructions. Execution of subsequent instructions do not begin until the processor determines that the I/O instruction has been completed.

Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).

Program synchronization can also be carried out with serializing instructions (see Section 8.3). These instructions are typically used at critical procedure or task boundaries to force completion of all previous instructions before a jump to a new section of code or a context switch occurs. Like the I/O and locking instructions, the processor waits until all previous instructions have been completed and all buffered writes have been drained to memory before executing the serializing instruction.

The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data. The functions of these instructions are as follows:

• SFENCE — Serializes all store (write) operations that occurred prior to the SFENCE instruction in the program instruction stream, but does not affect load operations.

• LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program instruction stream, but does not affect store operations.

• MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the program instruction stream.

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.

与 McCalpin 博士的解释相反2,我认为这部分关于 mfence 是否做了一些额外的事情有点模棱两可。涉及 IO、锁定指令和序列化指令的三个部分确实暗示它们在操作前后的内存操作之间提供了完整的屏障。它们对弱排序内存没有任何例外,在 IO 指令的情况下,人们还会假设它们需要以一致的方式与弱排序内存区域一起工作,因为这些区域通常用于 IO。

然后是 FENCE 指令的部分,它 明确地 提到了弱内存区域:"The SFENCE, LFENCE, and MFENCE instructions **provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data."

我们是否从字里行间看出,只有这些指令可以完成此操作,而前面提到的技术(包括锁定指令)对弱内存区域没有帮助?我们可以通过注意到栅栏指令被引入 3 与弱排序的非时间存储指令同时引入,以及通过 [=183] 中的文本找到对这个想法的一些支持=]11.6.13 可缓存性提示指令 专门处理弱顺序指令:

The degree to which a consumer of data knows that the data is weakly ordered can vary for these cases. As a result, the SFENCE or MFENCE instruction should be used to ensure ordering between routines that produce weakly-ordered data and routines that consume the data. SFENCE and MFENCE provide a performance-efficient way to ensure ordering by guaranteeing that every store instruction that precedes SFENCE/MFENCE in program order is globally visible before a store instruction that follows the fence.

同样,这里特别提到了 fence 指令,它适用于 fencing weakly ordered instructions。

我们还发现支持锁定指令可能不会在上面已经引用的最后一句话中的弱顺序访问之间提供障碍的观点:

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.

这里基本上意味着 FENCE 指令实质上取代了以前由序列化 cpuid 在内存排序方面提供的功能。但是,如果 lock 前缀指令提供与 cpuid 相同的屏障能力,那可能是之前建议的方式,因为这些通常比 cpuid 快得多,后者通常需要 200或更多周期。这意味着存在 lock 前缀指令无法处理的场景(可能是弱排序场景),以及 cpuid 被使用的场景,以及 mfence 现在建议作为替换,意味着比 lock 前缀指令更强的屏障语义。

但是,我们可以用不同的方式解释上面的一些内容:请注意,在围栏指令的上下文中,经常提到它们是性能高效的方式确保订购。因此,这些说明可能不是为了提供额外的障碍,而只是为了提供更有效的障碍。

事实上,sfence 在几个周期内比序列化指令要快得多,例如 cpuidlock-前缀指令通常是 20 个周期或更多。另一方面 mfence 不是 通常比锁定指令 4 快,至少在现代硬件上是这样。尽管如此,它在引入时或在未来的某些设计中可能会更快,或者可能 预期 会更快,但没有成功。

所以我不能根据手册的这些部分做出一定的评估:我认为你可以合理地论证它可以用任何一种方式解释。

我们可以进一步查看 Intel ISA 指南中各种非临时存储指令的文档。例如,在非临时存储 movnti 的文档中,您会找到以下引用:

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTI instructions if multiple processors might use different memory types to read/write the destination memory locations.

关于"if multiple processors might use different memory types to read/write the destination memory locations"的部分让我有点困惑。我希望这更像是说 "to enforce ordering in the globally visible write order between instructions using weakly ordered hints" 之类的话。事实上,实际的 内存类型 (例如,由 MTTR 定义)可能在这里甚至没有发挥作用:当使用弱排序指令时,排序问题可能仅出现在 WB 内存中.

性能

根据 Agner fog 的指令时序,据报道 mfence 指令在现代 CPU 上需要 33 个周期(背靠背延迟),但据报道更复杂的锁定指令如 lock cmpxchg只需要 18 个周期。

如果 mfence 提供的屏障语义不比 lock cmpxchg 强,则后者确实做了更多的工作,没有明显的理由让 mfence 显着 更长。当然,您可以争辩说 lock cmpxchgmfence 更重要,因此得到更多优化。 all 的锁定指令比 mfence 快得多,即使是不经常使用的指令,也削弱了这一论点。此外,您会想象如果所有 lock 指令共享一个屏障实现,mfence 将简单地使用相同的实现,因为这是最简单和最容易验证的。

所以在我看来,mfence 的较慢性能是 mfence 正在做一些 额外 .

的重要证据

0.5 这不是一个无懈可击的论点。勘误表中可能会出现一些明显 "by design" 而不是错误的东西,例如 popcnt 对目标寄存器的错误依赖 - 因此可以将某些勘误表视为一种更新期望的文档形式,而不是总是暗示硬件错误。

1 显然,lock-前缀指令 执行一个原子操作,这是不可能单独实现的使用 mfence,因此 lock 前缀的指令肯定具有附加功能。因此,为了使 mfence 有用,我们希望它在某些情况下具有额外的屏障语义, 表现更好。

2 也完全有可能他正在阅读不同版本的手册,其中散文不同。

3 SFENCE 在 SSE 中,lfencemfence 在 SSE2 中。

4 而且通常速度较慢:Agner 在最近的硬件上列出了 33 个周期的延迟,而锁定指令通常约为 20 个周期。