为什么我在每个 atomicAdd 之后看不到变量值的 different/unique 输出？

Question

我是 CUDA 新手。我试图在 GPU 上实现一个 trie 数据结构，但没有成功。我注意到我的 atomicAdd 没有按预期工作。所以我用 atomicAdd 做了一些实验。我写了这段代码：

#include <cstdio>

//__device__ int *a; //I also tried the code with using this __device__
                     //variable and allocating it inside kernel instead
                     //using cudaMalloc. Same Result

__global__ void AtomicTestKernel (int*a)
{
    *a = 0;
    __syncthreads();
    for (int i = 0; i < 2; i++)
    {
        if (threadIdx.x % 2)
        {
            atomicAdd(a, 1);
            printf("threadsIndex = %d\t&\ta : %d\n",threadIdx.x,*a);
        }
        else
        {
            atomicAdd(a, 1);
            printf("threadsIndex = %d\t&\ta : %d\n", threadIdx.x, *a);
        }
    }
}

int main()
{
    int * d_a;
    cudaMalloc((void**)&d_a, sizeof(int));

    AtomicTestKernel << <1, 10 >> > (d_a);

    cudaDeviceSynchronize();

    return 0;
}

纠正我关于这段代码的错误：

1 - 根据 CUDA 的编程指南：（关于原子函数）

... In other words, no other thread can access this address until the operation is complete

2 - int * d_a 驻留在全局内存中，内核的输入也是如此：int * a 因为它是使用 cudaMalloc 分配的（根据这个 3 分钟的视频：Udacity CUDA - Global Memory）因此所有线程都看到相同的 int * a 而不是每个线程都有自己的

3 - 在之前的代码中，每个 printf 都有一个 atomicAdd，所以我希望每个 printf 的值都不同于之前的 *a因此独一无二。

但是在我得到的结果中我看到 *a 有很多相同的变量这是我得到的结果：

threadsIndex = 0        &       a : 5
threadsIndex = 2        &       a : 5
threadsIndex = 4        &       a : 5
threadsIndex = 6        &       a : 5
threadsIndex = 8        &       a : 5
threadsIndex = 1        &       a : 10
threadsIndex = 3        &       a : 10
threadsIndex = 5        &       a : 10
threadsIndex = 7        &       a : 10
threadsIndex = 9        &       a : 10
threadsIndex = 0        &       a : 15
threadsIndex = 2        &       a : 15
threadsIndex = 4        &       a : 15
threadsIndex = 6        &       a : 15
threadsIndex = 8        &       a : 15
threadsIndex = 1        &       a : 20
threadsIndex = 3        &       a : 20
threadsIndex = 5        &       a : 20
threadsIndex = 7        &       a : 20
threadsIndex = 9        &       a : 20
Press any key to continue . . .

Answer 1

由于所有指令都在一个 warp 中同时执行，您的代码正在执行所有原子指令，然后执行 printf，因此，您正在读取所有原子操作的结果。

这是在 warp 中执行的指令：

Instruction | threadId 1       | threadId 2       | *a        
____________________________________________________________
AtomicAdd   | increasing value | waiting          | 1  
              waiting          | increasing value | 2
---------------------------------------------- Warp finished instruction of all AtomicAdd
reading *a  | read value       | read value       | 2

读取原子操作的先前值检查方法 atomicAdd

的结果

int previousValue = atomicAdd(a, 1);

你可以在这里得到一些信息：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd

为什么我在每个 atomicAdd 之后看不到变量值的 different/unique 输出？

Why I don't see different/unique outputs of my variable value after each atomicAdd?

cuda

atomic