CUDA内核原子操作unsigned long long变量减一

Subtract one from unsigned long long variable in atomic operation in CUDA kernel

我有一个 unsigned long long count 需要在 CUDA 内核中进行原子递减。我该如何以正确的方式做到这一点?

atomicAdd(&count, -1); // Impossible as second argument is also required to be `unsigned long long`.
atomicSub(&count, 1); // Impossible, because `unsigned long long` is not supported.

CUDA 支持的所有平台上的整数类型都使用二进制补码表示。这意味着从计数器 n 中减去数字 n 与添加 n 的二进制补码相同] 到 c。相同大小的整数类型可以很容易地在有符号和无符号表示之间进行类型转换。因此,我们可以使用更具可读性的 (unsigned long long int)(-1LL).

而不是像 0xffffffffffffffffULL 这样看起来有点像魔术常数的补码加 1

下面是一个简短的 CUDA 程序,它将 unsigned long long int 计数器初始化为非零起始值,然后在启动的内核中每个线程递减一次。请注意,为了简洁明了的说明,错误检查已被取消,这不是人们想要在实际代码中做的事情。

#include <stdio.h>
#include <stdlib.h>

#define BLOCK_COUNT  (2)
#define THREAD_COUNT (256)

__global__ void kernel (unsigned long long int *counter)
{
    atomicAdd (counter, (unsigned long long int)(-1LL));
}

int main (void)
{
    unsigned long long int counter;
    unsigned long long int *counter_d = 0;
    cudaMalloc ((void**)&counter_d, sizeof (*counter_d));
    cudaMemset (counter_d, 0x01, sizeof (*counter_d));
    cudaMemcpy (&counter, counter_d, sizeof counter, cudaMemcpyDeviceToHost);
    printf ("counter before kernel = %llu\n", counter);
    printf ("decrement counter with %lld threads\n", BLOCK_COUNT * THREAD_COUNT);
    kernel<<<BLOCK_COUNT, THREAD_COUNT>>>(counter_d);
    cudaMemcpy (&counter, counter_d, sizeof counter, cudaMemcpyDeviceToHost);
    printf ("counter after kernel  = %llu\n", counter);
    cudaFree (counter_d);
    cudaDeviceSynchronize ();
    return EXIT_SUCCESS;
}

这个程序的输出应该是这样的:

counter before kernel = 72340172838076673
decrement counter with 512 threads
counter after kernel  = 72340172838076161