为什么编译器在循环中从内存中加载这个指针

Question

我正在尝试确定什么开销 std::atomic 引入了我的系统（八核 x64）上的无条件内存写入。这是我的基准程序：

#include <atomic>
#include <iostream>
#include <omp.h>

int main() {
    std::atomic_int foo(0); // VERSION 1
    //volatile int foo = 0; // VERSION 2

    #pragma omp parallel
    for (unsigned int i = 0; i < 10000000; ++i) {
        foo.store(i, std::memory_order_relaxed); // VERSION 1
        //foo = i; // VERSION 2
    }

    std::cout << foo << std::endl;
}

程序原样将对 std::atomic_int 进行基准测试，注释标记为 VERSION 1 的行并取消注释标记为 VERSION 2 的行将在其位置测试 volatile int。即使不同步，两个程序的输出也应该是 10000000 - 1.

这是我的命令行：

g++ -O2 -std=c++11 -fopenmp test.c++

使用 atomic_int 的版本在我的系统上需要两到三秒，而使用 volatile int 的版本几乎总是在不到十分之一秒内完成。

汇编中的显着差异是（diff --side-by-side 的输出）：

volatile int                        atomic_int
.L2:                                .L2:
    mov DWORD PTR [rdi], eax          | mov rdx, QWORD PTR [rdi]
                                      > mov DWORD PTR [rdx], eax
    add eax, 1                          add eax, 1
    cmp eax, 10000000                   cmp eax, 10000000
    jne .L2                             jne .L2
    rep ret                             rep ret

rdi 是这个函数的第一个参数，它并行地得到运行（它在函数的任何地方都没有被修改），它显然是一个指针（指向，在第二列）整数 foo。我不相信这个额外的 mov 是 atomic_int.

的原子性保证不可或缺的一部分

额外的 mov 确实是 atomic_int 减速的根源；将它移到 L2 上方允许两个版本实现相同的性能并且都输出正确的数字。

当 foo 成为全局变量时，atomic_int 获得与 volatile int 相同的性能提升。

我的问题是：为什么编译器在堆栈分配 atomic_int 的情况下将指针传递给指针，而在全局 atomic_int 或堆栈分配的情况下仅传递指针已分配volatile int；为什么要在循环的每次迭代中加载该指针，因为它（我相信）是循环不变的代码；我可以对 C++ 源代码进行哪些更改以使 atomic_int 与此基准测试中的 volatile int 匹配？

更新

运行这个程序：

#include <atomic>
#include <iostream>
#include <thread>

//using T = volatile int; // VERSION 1
using T = std::atomic_int; // VERSION 2

void foo(T* ptr) {
    for (unsigned int i = 0; i < 10000000; ++i) {
        //*ptr = i; // VERSION 1
        ptr->store(i, std::memory_order_relaxed); // VERSION2
    }
}

int main() {
    T i { 0 };

    std::thread threads[4];

    for (auto& x : threads)
        x = std::move(std::thread { foo, &i });

    for (auto& x : threads)
        x.join();

    std::cout << i << std::endl;
}

为版本 1 和 2 产生相同的改进性能，这让我相信这是 OpenMP 的一个特性，它迫使 atomic_int 性能更差。 OpenMP 是正确的，还是生成了次优代码？

Answer 1

I do not believe that this extra mov is integral to the atomicity guarantee of atomic_int.

OpenMP 似乎有不同的想法。具有 OpenMP 原子性的易失性代码：

#include <atomic>
#include <iostream>
#include <omp.h>

int main() {
    volatile int foo = 0; // VERSION 2

    #pragma omp parallel
    for (unsigned int i = 0; i < 10000000; ++i) {
        #pragma omp atomic write
        foo = i; // VERSION 2
    }
    std::cout << foo << std::endl;
}

汇编输出：

.L2:
        movq    (%rdi), %rdx
        movl    %eax, (%rdx)
        addl    , %eax
        cmpl    000000, %eax
        jne     .L2
        ret

Answer 2

如果您查看程序的中间表示（-fdump-tree-all 是您的朋友）而不是汇编输出，事情会变得更容易理解。

Why is the compiler passing a pointer to a pointer in the case of a stack-allocated atomic_int but only a pointer in the case of global atomic_int or stack-allocated volatile int;

这是一个实现细节。 GCC 通过将并行区域概述为单独的函数来转换它们，然后接收一个包含所有共享变量的结构作为它们的唯一参数，还有 firstprivate 的初始值和 lastprivate 变量的最终值的占位符。当 foo 只是一个整数并且不存在隐式或显式 flush 区域时，编译器将其副本在参数中传递给概述的函数：

struct omp_data_s
{
   int foo;
};

void main._omp_fn.0(struct omp_data_s *omp_data_i)
{
   ...
   omp_data_i->foo = i;
   ...
}

int main() {
  volatile int foo = 0;

  struct omp_data_s omp_data_o;
  omp_data_o.foo = foo;

  GOMP_parallel(main._omp_fn.0, &omp_data_o, 0, 0);

  foo = omp_data_o.foo;
  ...
}

omp_data_i 通过 rdi 传递（根据 x86-64 ABI）并且 omp_data_i->foo = i; 编译为简单的 movl %rax, %(rdi)（假定存储 i在 rax) 中，因为 foo 是结构的第一个（也是唯一的）元素。

当foo为std::atomic_int时，它不再是一个整数，而是一个包裹整数值的结构体。在那种情况下，GCC 在参数结构中传递一个指针而不是值本身：

struct omp_data_s
{
   struct atomic_int *foo;
};

void main._omp_fn.0(struct omp_data_s *omp_data_i)
{
   ...
   __atomic_store_4(&omp_data_i->foo._M_i, i, 0);
   ...
}

int main() {
  struct atomic_int foo;

  struct omp_data_s omp_data_o;
  omp_data_o.foo = &foo;

  GOMP_parallel(main._omp_fn.0, &omp_data_o, 0, 0);

  ...
}

在这种情况下，额外的汇编指令 (movq %(rdi), %rdx) 是第一个指针（指向 OpenMP 数据结构）的取消引用，第二个是原子写入（在 x86-64 上只是一家商店）。

当 foo 是全局的时，它不会作为参数结构的一部分传递给概述的代码。在这种特殊情况下，代码接收到 NULL 指针，因为参数结构为空。

void main._omp_fn.0(void *omp_data_i)
{
   ...
   __atomic_store_4(&foo._M_i, i, 0);
   ...
}

why is it loading that pointer on every iteration of the loop since it is (I believe) loop-invariant code;

指针参数本身（rdi 的值）是循环不变的，但指向的值可能会在函数外发生变化，因为 foo 是一个共享变量。实际上，GCC 将 shared 的 OpenMP 数据共享 class 的所有变量视为 volatile。同样，这是一个实现细节，因为 OpenMP 标准允许一个宽松的一致性内存模型，在该模型中写入共享变量不会在其他线程中可见，除非 flush 构造在 writer 和 reader. GCC 实际上是在利用这种宽松的一致性来优化代码，方法是传递一些共享变量的副本而不是指向原始变量的指针（从而节省一次取消引用）。如果您的代码中有一个 flush 区域，则显式

foo = i;
#pragma omp flush(foo)

或隐式

#pragma omp atomic write
foo = i;

GCC 会传递一个指向 foo 的指针，而不是在另一个答案中看到的那样。原因是 flush 构造将线程的内存视图与全局视图同步，其中共享的 foo 引用原始变量（因此指向它的指针而不是副本）。

and what changes to the C++ source can I make to have atomic_int match volatile int in this benchmark?

除了切换到不同的编译器之外，我想不出任何 可移植 的改变。 GCC 将结构类型（std::atomic 是一个结构）的共享变量作为指针传递，仅此而已。

Is OpenMP correct, or is it generating suboptimal code?

OpenMP 是正确的。它是一个 multiplaform 规范，它定义了 GCC 遵循的特定（并且故意广泛的）内存和操作语义。对于特定平台上的特定情况，它可能并不总能为您提供最佳性能，但代码是可移植的，并且通过添加单个 pragma 从串行到并行相对容易。

当然，GCC 人员当然可以学习如何更好地优化 - 英特尔 C++ 编译器已经做到了：

                            # LOE rdx ecx
..B1.14:                    # Preds ..B1.15 ..B1.13
    movl      %ecx, %eax                                #13.13
    movl      %eax, (%rdx)                              #13.13
                            # LOE rdx ecx
..B1.15:                    # Preds ..B1.14
    incl      %ecx                                      #12.46
    cmpl      000000, %ecx                           #12.34
    jb        ..B1.14       # Prob 99%                  #12.34

为什么编译器在循环中从内存中加载这个指针

Why is the compiler loading this pointer from memory in a loop

c++

performance

multithreading

atomic

openmp

更新