理解内存障碍

Making sense of Memory Barriers

我试图在对 java 无锁 programmers.This 级别有用的级别上理解内存障碍,我觉得,介于学习 volatiles 和学习 [=35 的工作之间=] 来自 x86 手册的缓冲区。

我花了一些时间阅读了一堆 blogs/cookbooks 并得出了以下摘要。有没有知识渊博的人看看摘要,看看我是否遗漏或列出了错误的内容。

LFENCE:

Name             : LFENCE/Load Barrier/Acquire Fence
Barriers         : LoadLoad + LoadStore
Details          : Given sequence {Load1, LFENCE, Load2, Store1}, the
                   barrier ensures that Load1 can't be moved south and
                   Load2 and Store1 can't be moved north of the
                   barrier. 
                   Note that Load2 and Store1 can still be reordered.

Buffer Effect    : Causes the contents of the LoadBuffer 
                   (pending loads) to be processed for that CPU.This
                   makes program state exposed from other CPUs visible
                   to this CPU before Load2 and Store1 are executed.

Cost on x86      : Either very cheap or a no-op.
Java instructions: Reading a volatile variable, Unsafe.loadFence()

SFENCE

Name             : SFENCE/Store Barrier/Release Fence
Barriers         : StoreStore + LoadStore
Details          : Given sequence {Load1, Store1, SFENCE, Store2,Load2}
                   the barrier ensures that Load1 and Store1 can't be
                   moved south and Store2 can't be moved north of the 
                   barrier.
                   Note that Load1 and Store1 can still be reordered AND 
                   Load2 can be moved north of the barrier.
Buffer Effect    : Causes the contents of the StoreBuffer flushed to 
                   cache for the CPU on which it is issued.
                   This will make program state visible to other CPUs
                   before Store2 and Load1 are executed.
Cost on x86      : Either very cheap or a no-op.
Java instructions: lazySet(), Unsafe.storeFence(), Unsafe.putOrdered*()

MFENCE

Name             : MFENCE/Full Barrier/Fence
Barriers         : StoreLoad
Details          : Obtains the effects of the other three barrier.
                   Given sequence {Load1, Store1, MFENCE, Store2,Load2}, 
                   the barrier ensures that Load1 and Store1 can't be
                   moved south and Store2 and Load2 can't be moved north
                   of the barrier.
                   Note that Load1 and Store1 can still be reordered AND
                   Store2 and Load2 can still be reordered.
 Buffer Effect   : Causes the contents of the LoadBuffer (pending loads) 
                   to be processed for that CPU.
                   AND
                   Causes the contents of the StoreBuffer flushed to
                   cache for the CPU on which it is issued.
 Cost on x86     : The most expensive kind.
Java instructions: Writing to a volatile, Unsafe.fullFence(), Locks

最后,如果 SFENCE 和 MFENCE 都耗尽了 storeBuffer(使缓存行无效并等待来自其他 CPU 的确认),为什么一个是空操作而另一个是非常昂贵的操作?

谢谢

(从 Google 的 Mechanical Sympathy 论坛交叉发布)

您正在使用 Java,因此 真正 重要的是 Java 内存模型。编译时(包括 JIT)optimizations will re-order your memory accesses within the limitations of the Java memory model, not the stronger x86 memory model that the JVM happens to be JIT-compiling for. (See my answer to )

不过,学习 x86 可以为您的理解打下坚实的基础,但不要陷入认为 Java 在 x86 上的工作方式与在 x86 上的汇编类似的陷阱。 (或者整个世界都是 x86。许多其他架构都是弱排序的,例如 Java 内存模型。)


x86 LFENCESFENCE 就内存排序而言是空操作,除非您使用 movnt 弱排序缓存绕过存储。正常负载隐含 acquire-loads, and normal stores are implicitly release-stores.


根据 Intel 的指令集参考手册,您的 table: SFENCE is "not ordered with respect to load instructions" 有错误。它只是 StoreStore 障碍,而不是 LoadStore 障碍。

(link 是 Intel 的 pdf 的 html 转换。请参阅 标签 wiki 以获取 link 到官方版本。)

lfence 是 LoadLoad 和 LoadStore 障碍,所以你的 table 是正确的。

但 CPU 并不会真正 "buffer" 提前加载。他们这样做,并在结果可用后立即开始使用结果进行乱序执行。 (通常使用加载结果的指令在加载结果准备就绪之前就已被解码并发出,即使在 L1 缓存命中时也是如此)。这是加载和存储之间的根本区别。


SFENCE 很便宜,因为它实际上不必耗尽存储缓冲区。这是实现它的一种方法,它以性能为代价保持硬件简单。

MFENCE is expensive because it's the only barrier that prevents StoreLoad reordering. See Jeff Preshing's Memory Reordering Caught in the Act 的解释,以及实际演示真实硬件上的 StoreLoad 重新排序的测试程序。

Jeff Preshing 的博文是理解 lock-free programming 和内存排序语义的黄金。我通常 link 他的博客在我的 SO 中回答内存排序问题。如果您有兴趣阅读更多我写的内容(主要是 C++ / asm,而不是 Java,您可能可以使用搜索来找到这些答案)。


有趣的事实:x86 上的任何原子读-修改-写操作也是一个完整的内存屏障。隐含在 xchg [mem], reg 上的 lock 前缀也是一个完整的障碍。在 mfence 可用之前,lock add [esp], 0 是内存屏障的常见习语,否则就是空操作。 (堆栈内存在 L1 中几乎总是热的,而不是共享的)。

所以在 x86 上,递增原子计数器具有相同的性能,而不管您请求的内存排序语义如何。 (例如 c++11 memory_order_relaxedmemory_order_seq_cst(顺序一致性))。不过,请使用任何合适的内存顺序语义,因为其他架构可以在没有完整内存屏障的情况下执行原子操作。在不需要时强制编译器/JVM 使用内存屏障是一种浪费。