CUDA atomicAdd for doubles 定义错误

CUDA atomicAdd for doubles definition error

在以前的 CUDA 版本中,没有为双精度实现 atomicAdd,因此像 here 这样的实现很常见。使用新的 CUDA 8 RC,当我尝试编译包含此类功能的代码时,我 运行 遇到了麻烦。我猜这是因为使用 Pascal 和 Compute Capability 6.0,添加了 atomicAdd 的本机双版本,但不知何故,对于以前的 Compute Capabilities 并没有正确忽略它。

下面的代码用于编译 运行 以前的 CUDA 版本没问题,但现在我得到这个编译错误:

test.cu(3): error: function "atomicAdd(double *, double)" has already been defined

但是如果我删除我的实现,我反而会得到这个错误:

test.cu(33): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (double *, double)

我应该补充一点,如果我用 -arch=sm_35 或类似的编译,我只会看到这个。如果我用 -arch=sm_60 编译,我会得到预期的行为,即只有第一个错误,并且在第二种情况下编译成功。

编辑:此外,它是 atomicAdd 特有的——如果我更改名称,效果很好。

看起来确实像是编译器错误。其他人可以确认是这种情况吗?

示例代码:

__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull = (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                __double_as_longlong(val + __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}

__global__ void kernel(double *a)
{
    double b=1.3;
    atomicAdd(a,b);
}

int main(int argc, char **argv)
{
    double *a;
    cudaMalloc(&a,sizeof(double));

    kernel<<<1,1>>>(a);

    cudaFree(a);
    return 0;
}

编辑:我从认识到这个问题的 Nvidia 那里得到了答复,以下是开发人员对此的评价:

The sm_60 architecture, that is newly supported in CUDA 8.0, has native fp64 atomicAdd function. Because of the limitations of our toolchain and CUDA language, the declaration of this function needs to be present even when the code is not being specifically compiled for sm_60. This causes a problem in your code because you also define a fp64 atomicAdd function.

CUDA builtin functions such as atomicAdd are implementation-defined and can be changed between CUDA releases. Users should not define functions with the same names as any CUDA builtin functions. We would suggest you to rename your atomicAdd function to one that is not the same as any CUDA builtin functions.

这种 atomicAdd 风格是为计算能力 6.0 引入的新方法。您可以使用宏定义

保留您之前实现的其他计算功能来保护它
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
<... place here your own pre-pascal atomicAdd definition ...>
#endif

这个名为架构识别宏的宏被记录在案here:

5.7.4. Virtual Architecture Identification Macro

The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy.

This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.

我假设 NVIDIA 没有将它放在以前的 CC 中,以避免用户定义它的冲突,而不是移动到计算能力 >= 6.x。我不认为这是一个 BUG,而是一个发布交付实践。

编辑:宏守卫不完整(已修复)——这里是一个完整的例子。

#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
__device__ double atomicAdd(double* a, double b) { return b; }
#endif

__device__ double s_global ;
__global__ void kernel () { atomicAdd (&s_global, 1.0) ; }


int main (int argc, char* argv[])
{
        kernel<<<1,1>>> () ;
        return ::cudaDeviceSynchronize () ;
}

编译:

$> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26

命令行(均成功):

$> nvcc main.cu -arch=sm_60
$> nvcc main.cu -arch=sm_35

您可能会发现它与包含文件一起工作的原因:sm_60_atomic_functions.h,如果 __CUDA_ARCH__ 低于 600,则不声明该方法。