理解 linux 中的 membarrier 函数

Question

使用 linux 手册中的 membarrier 函数的示例：https://man7.org/linux/man-pages/man2/membarrier.2.html

       #include <stdlib.h>

       static volatile int a, b;

       static void
       fast_path(int *read_b)
       {
           a = 1;
           asm volatile ("mfence" : : : "memory");
           *read_b = b;
       }

       static void
       slow_path(int *read_a)
       {
           b = 1;
           asm volatile ("mfence" : : : "memory");
           *read_a = a;
       }

       int
       main(int argc, char **argv)
       {
           int read_a, read_b;

           /*
            * Real applications would call fast_path() and slow_path()
            * from different threads. Call those from main() to keep
            * this example short.
            */

           slow_path(&read_a);
           fast_path(&read_b);

           /*
            * read_b == 0 implies read_a == 1 and
            * read_a == 0 implies read_b == 1.
            */

           if (read_b == 0 && read_a == 0)
               abort();

           exit(EXIT_SUCCESS);
       }

上面的代码转换为使用 membarrier() 变成：

       #define _GNU_SOURCE
       #include <stdlib.h>
       #include <stdio.h>
       #include <unistd.h>
       #include <sys/syscall.h>
       #include <linux/membarrier.h>

       static volatile int a, b;

       static int
       membarrier(int cmd, unsigned int flags, int cpu_id)
       {
           return syscall(__NR_membarrier, cmd, flags, cpu_id);
       }

       static int
       init_membarrier(void)
       {
           int ret;

           /* Check that membarrier() is supported. */

           ret = membarrier(MEMBARRIER_CMD_QUERY, 0, 0);
           if (ret < 0) {
               perror("membarrier");
               return -1;
           }

           if (!(ret & MEMBARRIER_CMD_GLOBAL)) {
               fprintf(stderr,
                   "membarrier does not support MEMBARRIER_CMD_GLOBAL\n");
               return -1;
           }

           return 0;
       }

       static void
       fast_path(int *read_b)
       {
           a = 1;
           asm volatile ("" : : : "memory");
           *read_b = b;
       }

       static void
       slow_path(int *read_a)
       {
           b = 1;
           membarrier(MEMBARRIER_CMD_GLOBAL, 0, 0);
           *read_a = a;
       }

       int
       main(int argc, char **argv)
       {
           int read_a, read_b;

           if (init_membarrier())
               exit(EXIT_FAILURE);

           /*
            * Real applications would call fast_path() and slow_path()
            * from different threads. Call those from main() to keep
            * this example short.
            */

           slow_path(&read_a);
           fast_path(&read_b);

           /*
            * read_b == 0 implies read_a == 1 and
            * read_a == 0 implies read_b == 1.
            */

           if (read_b == 0 && read_a == 0)
               abort();

           exit(EXIT_SUCCESS);
       }

此“membarrier”描述摘自 Linux 手册。我仍然对“membarrier”函数如何向慢速端增加开销，并从快速端移除开销，从而导致整体性能提高感到困惑，只要慢速端的频率足够低，以至于 membarrier 的开销（ ) 调用不会超过快速端的性能增益。

能否请您帮我描述得更详细些。

谢谢！

Answer 1

这对 write-then-read-the-other-var 是 https://preshing.com/20120515/memory-reordering-caught-in-the-act/, a demo of StoreLoad reordering (the only kind x86 allows, given its program-order + with 内存模型）。

只有一个本地 MFENCE，您仍然可以重新排序：

   FAST                      using just mfence, not membarrier
a = 1 exec
read_b = b;  // 0
                             b = 1;
                             mfence   (force b=1 to be visible before reading a)
                             read_a = a;   // 0
a = 1 visible (global vis. delayed by store buffer)

但是请考虑一下，如果每个核心上的 mfence 必须成为慢速路径的存储和重新加载之间的每个可能顺序的一部分，会发生什么情况。

此排序将不再可能。如果 read_b=b 已经读到 0，那么 a=1 已经挂起 ¹（如果它还不可见）。它不可能在 read_a = a 之前保持私有，因为 membarrier() 确保在每个核心上运行一个完整的屏障，并且 SLOW 在读取 [=14= 之前等待它发生（return 的 membarrier） ].

并且没有办法让 0,0 先执行 SLOW；它自己运行 membarrier，因此它的存储在读取 a.

之前对其他线程绝对可见

脚注 1：等待执行，或者在存储缓冲区中等待提交到 L1d 缓存。 asm("":::"memory") 确保了这一点，但实际上是多余的，因为 volatile 本身保证访问按程序顺序在 asm 中发生。在手动滚动原子而不是使用 C11 _Atomic 时，出于其他原因，我们基本上需要 volatile。（但通常 don't do that 除非你真的在写内核代码。使用 atomic_store_explicit(&a, 1, memory_order_release);）。

请注意，实际上是创建 StoreLoad 重新排序的存储缓冲区（x86 允许的唯一一种），. In fact, a store buffer also lets x86 execute stores out-of-order and then make them globally visible in program order (！）。

另请注意，有序 CPU 可以乱序进行内存访问。它们开始指令（包括加载），但可以让它们乱序完成，例如通过计分板负载允许击中未命中。另见

理解 linux 中的 membarrier 函数

understand membarrier function in linux

c

linux

x86

memory-barriers