Shared Memory 的 atomicAdd with int 和 float 有不同 SASS

Question

我遇到了一个性能问题，在使用 nv-nsight-cu-cli 进行分析后，float 上的共享内存 atomicAdd 比 int 上的要昂贵得多。查看生成的SASS，发现共享内存的atomicAdd在float和int上生成的SASS根本不一样。

这里我用最小的cuda代码展示了一个例子：

$ cat test.cu                                                                                                                                                                                                                                                   
__global__ void testAtomicInt() {
    __shared__ int SM_INT;
    SM_INT = 0;
    __syncthreads();
    atomicAdd(&(SM_INT), ((int)1));
}

__global__ void testAtomicFloat() {
    __shared__ float SM_FLOAT;
    SM_FLOAT = 0.0;
    __syncthreads();
    atomicAdd(&(SM_FLOAT), ((float)1.1));
}

$ nvcc -arch=sm_86 -c test.cu 
$ cuobjdump -sass test.o                                                                                                                                                                                                                                        
Fatbin elf code:
================
arch = sm_86
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

    code for sm_86
        Function : _Z15testAtomicFloatv
    .headerflags    @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;                  /* 0x00000a0000017a02 */
                                                                           /* 0x000fc40000000f00 */
        /*0010*/                   STS [RZ], RZ ;                          /* 0x000000ffff007388 */
                                                                           /* 0x000fe80000000800 */
        /*0020*/                   BAR.SYNC 0x0 ;                          /* 0x0000000000007b1d */
                                                                           /* 0x000fec0000000000 */
        /*0030*/                   LDS R2, [RZ] ;                          /* 0x00000000ff027984 */
                                                                           /* 0x000e240000000800 */
        /*0040*/                   FADD R3, R2, 1.1000000238418579102 ;    /* 0x3f8ccccd02037421 */
                                                                           /* 0x001fcc0000000000 */
        /*0050*/                   ATOMS.CAST.SPIN R3, [RZ], R2, R3 ;      /* 0x00000002ff03738d */
                                                                           /* 0x000e240001800003 */
        /*0060*/                   ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT ;  /* 0x000000010300780c */
                                                                           /* 0x001fda0003f02070 */
        /*0070*/              @!P0 BRA 0x30 ;                              /* 0xffffffb000008947 */
                                                                           /* 0x000fea000383ffff */
        /*0080*/                   EXIT ;                                  /* 0x000000000000794d */
                                                                           /* 0x000fea0003800000 */
        /*0090*/                   BRA 0x90;                               /* 0xfffffff000007947 */
                                                                           /* 0x000fc0000383ffff */
        /*00a0*/                   NOP;                                    /* 0x0000000000007918 */
        ..........


        Function : _Z13testAtomicIntv
    .headerflags    @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;         /* 0x00000a0000017a02 */
                                                                  /* 0x000fc40000000f00 */
        /*0010*/                   STS [RZ], RZ ;                 /* 0x000000ffff007388 */
                                                                  /* 0x000fe80000000800 */
        /*0020*/                   BAR.SYNC 0x0 ;                 /* 0x0000000000007b1d */
                                                                  /* 0x000fec0000000000 */
        /*0030*/                   ATOMS.POPC.INC.32 RZ, [URZ] ;  /* 0x00000000ffff7f8c */
                                                                  /* 0x000fe2000d00003f */
        /*0040*/                   EXIT ;                         /* 0x000000000000794d */
                                                                  /* 0x000fea0003800000 */
        /*0050*/                   BRA 0x50;                      /* 0xfffffff000007947 */
                                                                  /* 0x000fc0000383ffff */
        /*0060*/                   NOP;                           /* 0x0000000000007918 */
        ..........



Fatbin ptx code:
================
arch = sm_86
code version = [7,5]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

从上面生成的SASS代码中，我们可以清楚的得到，共享内存的atomicAdd在int上生成单个轻量级ATOMS.POPC.INC.32 RZ, [URZ]，而在float生成了一堆SASS和一个重量级的ATOMS.CAST.SPIN R3, [RZ], R2, R3 .

CUDA Binary Utilities 没有告诉我 CAST 或 SPIN 的含义。但是，我猜这意味着共享内存地址上的独占自旋锁。（纠正我，如果我猜错了。）在我的真实代码中，int的atomicAdd的SASS的none有一个热点。但是，此 ATOMS.CAST.SPIN 比 float 的 atomicAdd 生成的其他 SASS 代码要热得多。

此外，我使用编译器标志 -arch=sm_86、-arch=sm_80 和 -arch=sm_75 进行了测试。在那些CC下，float的atomicAdd生成的SPSS代码非常相似。另一个事实是，double 的 atomicAdd 生成的 SPSS 与 float.

相似

这个观察比问题更让我困惑。我会根据我的分析经验提出一些简单的问题，希望我们能进行愉快的讨论。

ATOMS.CAST.SPIN 到底是做什么的？我所知道的唯一 SASS 文档是 CUDA Binary Utilities.
为什么 float 的 atomicAdd 会生成更多 SASS 代码并且比 int 做更多的工作？我知道这是一个普遍的问题，很难回答。也许 ATOMS.POPC.INC 根本不适用于数据类型 float 或 double?
如果更多的共享内存加载和存储冲突更容易受到攻击，因此 更多的停顿时间 atomicAdd 的 float 比 atomicAdd 的 int？前者显然有更多的指令要执行和不同的分支。我的项目中有以下代码片段，其中两个函数的函数调用次数相同。但是，float 的 atomicAdd 会造成运行时瓶颈，而 int 不会。

atomicAdd(&(SM_INT), ((int)1));  // no hotspot
atomicAdd(&(SM_FLOAT), ((float)1.1)); // a hotspot

Answer 1

我可能无法提供解决所有可能问题的答案。 CUDA SASS 确实没有记录到解释这些东西的水平。

What does exactly ATOMS.CAST.SPIN do? The only SASS document I am aware of is the CUDA Binary Utilities.

ATOMS.CAST.SPIN
^^^^^ ^^^     
   ||   |  
   ||   compare and swap
   |shared
   atomic

programming guide gives an indication of how one can implement an "arbitrary" atomic operation, using atomic CAS (Compare And Swap). You should first familiarize yourself with how atomic CAS works.

关于“任意原子”示例，需要注意的是，它显然可以用于提供原子操作，例如“本机”原子指令不支持的数据类型，例如原子添加。另一件需要注意的事情是，它本质上是一个围绕原子 CAS 指令的循环，循环检查操作是否“成功”。如果它“不成功”，则循环继续。如果“成功”，则循环退出。

这实际上就是我们在您的 float 示例中的 SASS 代码中看到的内容：

/*0030*/  LDS R2, [RZ] ;  // get the current value in the location
FADD R3, R2, 1.1000000238418579102 ; // perform ordinary floating-point add
ATOMS.CAST.SPIN R3, [RZ], R2, R3 ;  //  attempt to atomically replace the result in the location
ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT ; // check if replacement was successful
@!P0 BRA 0x30   // if not, loop and try again

这些基本上是编程指南中“任意原子”示例中概述的步骤。基于此，我得出以下结论：

您为其编译的体系结构实际上没有您请求类型的“本机”原子操作
您请求的原子操作可以使用循环的方式完成
编译器工具链（通常是 ptxas，但也可能是 JIT 系统）作为一项便利功能，会自动为您实现此循环方法，而不是抛出编译错误

Why should the atomicAdd of float generates more SASS code and does more work than it on int?

显然，您正在编译的体系结构没有 float 的原子添加的“本机”实现，因此编译器工具链已选择为您实现此循环方法。由于循环有效地涉及 success/failure 的可能性，它将确定此循环是否继续，并且 success/failure 取决于其他线程的行为（执行原子的争用），循环方法可能会做更多的“工作”比本地单指令会。

If it is more vulnerable to have more shared memory load and store conflict and thus more stall time for the atomicAdd of float than the atomicAdd of int?

是的，我个人认为原生原子方法更高效，而循环方法可能效率较低，这可以在分析器中以多种方式表达，例如 warp stalls。

编辑：

在一种 GPU 架构中可能 implemented/available 而在另一种架构中则不然。这当然适用于原子，如果您阅读编程指南中先前链接的关于原子的部分，您可以看到这方面的示例。我不知道今天有什么架构比 cc8.0 或 cc8.6（安培）“更新”，但未来（或任何其他）GPU 的行为肯定有可能在这里有所不同。
此 loop-around-atomicCAS 方法不同于以前的方法（lock/update/unlock, which also involves a loop for lock negotiation) the compiler toolchain used on Kepler and prior architectures 在没有正式的 SASS 指令时在共享内存上提供原子操作。

Shared Memory 的 atomicAdd with int 和 float 有不同 SASS

Shared Memory's atomicAdd with int and float have different SASS

cuda

atomic

sass

gpu-shared-memory