忙等待循环中是否需要内存屏障或原子操作？

Question

考虑以下 spin_lock() 实现，最初来自 this answer：

void spin_lock(volatile bool* lock)  {  
    for (;;) {
        // inserts an acquire memory barrier and a compiler barrier
        if (!__atomic_test_and_set(lock, __ATOMIC_ACQUIRE))
            return;

        while (*lock)  // no barriers; is it OK?
            cpu_relax();
    }
}

我已经知道的：

volatile 防止编译器优化 *lock 在 while 循环的每次迭代中重新读取；
volatile inserts neither memory nor compiler barriers;
这样的实现实际上在 GCC 中适用于 x86（例如在 Linux 内核中）和一些其他架构；
至少一个内存和编译器障碍is required in spin_lock() implementation for a generic architecture; this example inserts them in __atomic_test_and_set()。

问题：

此处 volatile 是否足够，或者是否有任何体系结构或编译器在 while 循环中需要内存或编译器屏障或原子操作？

1.1 根据C++标准?

1.2 实际上，对于已知的体系结构和编译器，特别是对于 GCC 及其支持的平台？
此实现在 GCC 和 Linux 支持的所有体系结构上安全吗？（在某些架构上至少 效率低下 ，对吧？）
根据 C++11 及其内存模型，while 循环是否安全？

有几个相关的问题，但我无法从中构建一个明确而明确的答案：

Q: Memory barrier in a single thread

In principle: Yes, if program execution moves from one core to the next, it might not see all writes that occurred on the previous core.
Q: memory barrier and cache flush

On pretty much all modern architectures, caches (like the L1 and L2 caches) are ensured coherent by hardware. There is no need to flush any cache to make memory visible to other CPUs.
Q: Is my spin lock implementation correct and optimal?
Q: Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?
Q: Do you expect that future CPU generations are not cache coherent?

Answer 1

来自Wikipedia page on memory barriers：

... Other architectures, such as the Itanium, provide separate "acquire" and "release" memory barriers which address the visibility of read-after-write operations from the point of view of a reader (sink) or writer (source) respectively.

对我来说，这意味着 Itanium 需要一个合适的栅栏来使 reads/writes 对其他处理器可见，但这实际上可能只是为了订购。我认为，这个问题真的可以归结为：

是否存在处理器可能永远不会更新其本地缓存的架构，如果没有指示这样做？我不知道答案，但如果你提出问题以这种形式，其他人可能会。在这样的架构中，您的代码可能会进入无限循环，其中 *lock 的读取始终看到相同的值。

就一般的 C++ 合法性而言，您的示例中的一个原子测试和设置是不够的，因为它只实现了一个栅栏，它允许您在 *lock 时看到 *lock 的初始状态进入 while 循环但看不到它何时更改（这会导致未定义的行为，因为您正在读取一个在另一个线程中更改而没有同步的变量） - 所以您的问题 (1.1/3) 的答案是没有.

另一方面，在实践中，(1.2/2) 的答案是肯定的（给定 GCC's volatile semantics），只要体系结构保证缓存一致性而没有显式内存栅栏，x86 也是如此并且可能适用于许多体系结构，但我无法就是否适用于 GCC 支持的所有体系结构给出明确的答案。然而，根据语言规范，故意依赖在技术上未定义行为的代码的特定行为通常是不明智的，尤其是如果不这样做也可以获得相同的结果。

顺便说一句，鉴于 memory_order_relaxed 存在，在这种情况下似乎没有理由不使用它而不是尝试通过使用非原子读取来手动优化，即将示例中的 while 循环更改为:

    while (atomic_load_explicit(lock, memory_order_relaxed)) {
        cpu_relax();
    }

例如，在 x86_64 上，原子加载变为常规 mov 指令，优化后的汇编输出与原始示例基本相同。

Answer 2

这很重要：在 C++ 中 volatile 与并发性完全没有任何关系！ volatile 的目的是告诉 编译器 它不应优化对受影响对象的访问。它不告诉CPU任何东西，主要是因为CPU已经知道内存是否volatile。 volatile的目的是有效处理内存映射I/O.

C++ 标准在第 1.10 节 [intro.multithread] 中非常明确，对在一个线程中修改并在另一个线程中访问（修改或读取）的对象进行非同步访问是未定义的行为。避免未定义行为的同步原语是库组件，例如原子类或互斥体。此子句仅在信号上下文（即 volatile sigatomic_t）和前向进度上下文中提及 volatile（即线程最终将执行具有可观察效果的操作，例如访问 volatile 反对或做 I/O）。没有提到 volatile 与同步有关。

因此，对跨线程共享的变量的不同步评估会导致未定义的行为。是否声明 volatile 与此未定义行为无关。

Answer 3

Is volatile enough here or are there any architectures or compilers where memory or compiler barrier or atomic operation is required in the while loop?

可变代码会看到变化吗？是的，但不一定像有内存障碍那样快。在某些时候，会发生某种形式的同步，并且会从变量中读取新状态，但无法保证代码中其他地方发生了多少。

1.1 According to C++ standards?

来自cppreference : memory_order

内存模型和内存顺序定义了代码需要在其上运行的通用硬件。对于在执行线程之间传递的消息，需要发生线程间先发生关系。这需要...

A 与 B 同步
A 在 B
A 与 B 间接同步（通过 X）。
A 排在 X 之前，线程间发生在 B 之前
线程间发生在 X 之前，X 线程间发生在 B 之前。

由于您没有执行任何这些情况，因此您的程序的某些形式在当前的某些硬件上可能会失败。

实际上，时间片的结束会导致内存变得一致，或者非自旋锁线程上的任何形式的屏障都会确保缓存被刷新。

不确定易失性读取获取 "current value" 的原因。

1.2 In practice, for known architectures and compilers, specifically for GCC and platforms it supports?

由于代码与 C++11 中的广义 CPU 不一致，因此此代码可能无法在试图遵守标准的 C++ 版本中执行。

来自cppreference : const volatile qualifiers 易失性访问阻止优化将工作从之前转移到之后，以及从之后转移到之前。

"This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution"

因此，实现必须确保指令是从内存位置而不是任何本地副本读取的。但它不必确保通过缓存刷新易失性写入以生成跨所有 CPU 的连贯视图。从这个意义上讲，写入 volatile 变量后多长时间对另一个线程可见是没有时间界限的。

另见 kernel.org why volatile is nearly always wrong in kernel

Is this implementation safe on all architectures supported by GCC and Linux? (It is at least inefficient on some architectures, right?)

无法保证易变消息从设置它的线程中消失。所以不是很安全。在 linux 上它可能是安全的。

Is the while loop safe according to C++11 and its memory model?

否 - 因为它不创建任何线程间消息传递原语。

忙等待循环中是否需要内存屏障或原子操作？

Is memory barrier or atomic operation required in a busy-wait loop?

c++

multithreading

gcc

spinlock

memory-barriers