LFENCE 的 x86-64 用法

Question

我正在尝试了解使用 RDTSC/RDTSCP 测量时间时使用栅栏的正确方法。与此相关的SO的几个问题已经得到了详细的回答。我已经经历了其中的一些。我还阅读了关于同一主题的这篇非常有用的文章： http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

然而，在另一个在线博客中，有一个在 x86 上使用 LFENCE 而不是 CPUID 的示例。我想知道 LFENCE 如何防止早期存储污染 RDTSC 测量值。例如

<Instr A>
LFENCE/CPUID
RDTSC
<Code to be benchmarked>
LFENCE/CPUID
RDTSC

在上述情况下，LFENCE 确保所有较早的加载都在它之前完成（因为 SDM 说：LFENCE 指令无法通过较早的读取。）。但是早期的商店呢（比如，Instr A 是一家商店）？我理解为什么 CPUID 有效，因为它是一个序列化指令，但 LFENCE 不是。

我找到的一个解释是在 Intel SDM VOL 3A Section 8.3，下面的脚注：

LFENCE 确实对指令顺序提供了一些保证。它直到所有先前的指令都在本地完成后才会执行，并且直到 LFENCE 完成后才会开始执行后续指令。

所以 LFENCE 本质上就像 MFENCE。在那种情况下，为什么我们需要两条单独的指令 LFENCE 和 MFENCE？

我可能漏掉了什么。

提前致谢。

Answer 1

正如您正确地观察到的那样，这是一个序列化的问题。关于你的问题

why do we need two separate instructions LFENCE and MFENCE?

在“5.6.4 - ”部分的 Intel SDM 中得到回答 SSE2 可缓存性控制和排序说明":

LFENCE Serializes load operations
MFENCE Serializes load and store operations

所以可能使用 LFENCE 因为 MFENCE 对于 RDTSC 不是必需的。

Answer 2

关键是引用句子“It does not execute until all prior instructions have completed locally”中的副词locally。

整套Intel手册找不到明确的定义"complete locally"，我的推测解释如下。

为了在本地完成，指令必须计算输出并可供其依赖链中更下游的其他指令使用。此外，该指令的任何副作用都必须在核心内部可见。

为了全局完成，一条指令的副作用必须对其他系统组件（如其他 CPU）可见。

如果我们不限定我们正在谈论的 "completeness" 类型，通常意味着它不关心或者它在上下文中是隐含的。

很多指令在本地和全局完成，是一样的。
例如，对于load，为了在本地完成，必须从内存或缓存中获取一些数据。这与全局完成相同，因为如果我们不先从内存层次结构中读取，就无法将加载标记为完成。

对于 store 但是情况不同。

Intel 处理器有一个 Store Buffer 来处理内存写入，来自手册 3 的第 11.10 章：

Intel 64 and IA-32 processors temporarily store each write (store) to memory in a store buffer. The store buffer improves processor performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or to a cache is complete. It also allows writes to be delayed for more efficient use of memory-access bus cycles.

因此，存储可以通过放入存储缓冲区来在本地完成，从核心角度来看，写入就像是一直到内存。
来自同一存储核心的负载，在特定情况下，甚至可以读回该值（这称为 Store Forwarding）。

要在全球范围内完成，但是需要从存储缓冲区中耗尽存储。

最后必须补充一点，存储缓冲区被序列化指令耗尽：

The contents of the store buffer are always drained to memory in the following situations:
• (P6 and more recent processor families only) When a serializing instruction is executed.
• (Pentium III, and more recent processor families only) When using an SFENCE instruction to order stores.
• (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores.

介绍完了，让我们看看 lfence、mfence 和 sfence 做了什么：

LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

MFENCE performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. MFENCE does not serialize the instruction stream.

SFENCE performs a serializing operation on all store-to-memory instructions that were issued prior the SFENCE instruction.

所以 lfence 是较弱的序列化形式， 不会耗尽存储缓冲区 ，因为它有效地在本地序列化指令，所以必须在它之前完成所有加载它完成了。

sfence 仅序列化存储，它基本上不允许进程执行更多存储，直到 sfence 退出。它还会耗尽存储缓冲区。

mfence 不是两者的简单组合，因为它不是经典意义上的序列化，它是 sfence 也可以防止将来要执行的负载。

先引入 sfence 后引入其他两个可能毫无价值，以实现对内存排序的更精细控制。

最后，我习惯于关闭两个 lfence 指令之间的 rdtsc 指令，以确保不会重新排序 "backward" 和 "forward" 是可能的。
但是我确信这种技术的可靠性。

LFENCE 的 x86-64 用法

x86-64 usage of LFENCE

assembly

x86-64

atomic

cpu-architecture