x86_64 平台上是否需要 rdtsc 的 mfence？

Question

unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
    "mfence;rdtsc" : "=a"(lo), "=d"(hi) : : "memory"
);

上面代码中的

mfence，有必要吗？

根据我的测试，cpu 未找到重新订购。

测试代码片段如下。

inline uint64_t clock_cycles() {
    unsigned int lo = 0;
    unsigned int hi = 0;
    __asm__ __volatile__ (
        "rdtsc" : "=a"(lo), "=d"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

unsigned t1 = clock_cycles();
unsigned t2 = clock_cycles();
assert(t2 > t1);

Answer 1

mfence 在 rdtsc.

之前是否在 CPU 中强制序列化

通常你会在那里找到cpuid（这也是序列化指令）。

引用英特尔手册中关于使用 rdtsc 的内容会更清楚

Starting with the Intel Pentium processor, most Intel CPUs support out-of-order execution of the code. The purpose is to optimize the penalties due to the different instruction latencies. Unfortunately this feature does not guarantee that the temporal sequence of the single compiled C instructions will respect the sequence of the instruction themselves as written in the source C file. When we call the RDTSC instruction, we pretend that that instruction will be executed exactly at the beginning and at the end of code being measured (i.e., we don’t want to measure compiled code executed outside of the RDTSC calls or executed in between the calls themselves). The solution is to call a serializing instruction before calling the RDTSC one. A serializing instruction is an instruction that forces the CPU to complete every preceding instruction of the C code before continuing the program execution. By doing so we guarantee that only the code that is under measurement will be executed in between the RDTSC calls and that no part of that code will be executed outside the calls.

TL;DR 版本 - 在 rdtsc 之前没有序列化指令，您不知道该指令何时开始执行，从而导致测量结果可能不正确。

提示 - 尽可能使用 rdtscp。

Based on my test, cpu reorder is not found.

仍然不能保证它会发生 - 这就是为什么原始代码有 "memory" 来指示可能的内存破坏阻止编译器对其重新排序。

Answer 2

要使用 rdtsc 执行合理的测量，您需要的是序列化指令。

众所周知，很多人在cpuid 之前使用 rdtsc.
rdtsc需要从上面和下面进行序列化（阅读：它之前的所有指令必须被退休并且它必须在之前被退休测试代码开始）。

不幸的是，第二个条件经常被忽略，因为 cpuid 对于这个任务来说是一个非常糟糕的选择（它破坏了 rdtsc 的输出）。
在寻找替代方案时，人们认为名称中带有 "fence" 的指令就可以，但这也是不正确的。直接来自英特尔：

MFENCE does not serialize the instruction stream.

几乎序列化并且将在先前存储不需要完成的任何测量中执行的指令是lfence。

简单地说，lfence 确保在任何先前的指令 在本地完成 之前没有新的指令开始。参见。
它也不会像 mfence 那样耗尽存储缓冲区，也不会像 cpuid 那样破坏寄存器。

所以 lfence / rdtsc / lfence 是比 mfence / rdtsc 更好的指令序列，其中 mfence 几乎没有用，除非您明确希望在测试之前完成之前的存储 begins/ends（但不是在执行 rdstc 之前！）。

如果您检测重新排序的测试是 assert(t2 > t1) 那么我相信您不会测试任何东西。
忽略 return 和可能会或可能不会阻止 CPU 及时看到第二个 rdtsc 以进行重新排序的调用，[= 不太可能（尽管可能！） 94=] 将重新排序两个 rdtsc，即使一个紧接着另一个。

假设我们有一个 rdtsc2 完全像 rdtsc 但写 ecx:ebx¹.

正在执行

rdtsc
rdtsc2

很可能 ecx:ebx > edx:eax 因为 CPU 没有理由 在 rdtsc 之前执行 rdtsc2。
重新排序不是随机排序，而是寻找其他指令如果当前指令无法执行。
但是rdtsc不依赖于任何前面的指令，所以在OoO核心遇到时不太可能被延迟。
然而，特殊的内部 micro-architectural 细节可能会使我的论点无效，因此在我之前的陈述中可能这个词。

¹ 我们不需要这个修改过的指令：寄存器重命名就可以了，但如果你不熟悉它，这会有所帮助。

x86_64 平台上是否需要 rdtsc 的 mfence？

Is mfence for rdtsc necessary on x86_64 platform?

c++

linux

timestamp

x86-64

rdtsc