CUDA 字节原子操作只导致一个线程动作

Question

我正在编写一个 CUDA 程序，它在共享内存中定义了一个数组。我需要做的是只允许一个线程写入这个数组中的每个索引，i。 e.到达此写入指令的第一个线程应更改其值，但同一 warp 或下一个 warp 中的任何其他线程都应读取写入的值。

这是代码片段：

char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
    seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
    printf("copy seq_shared seq_1_index = %d,  block = %d \n", seq_1_index, blockIdx.x);
}

现在发生的事情是 warp 中的所有线程都执行这些精确的指令序列，因此 if 条件中的剩余代码被执行了 32 次。我只需要执行一次。

我怎样才能做到这一点？

Answer 1

您可以为此使用 atomicCAS()。它执行原子比较和交换操作。

此函数将测试一个变量，如果它符合特定条件（例如，false），它将用另一个值（例如，true）替换它。它将自动完成所有这些事情，即没有中断的可能性。

原子函数的 return 值在这种情况下为我们提供了有用的信息。如果上面示例中的 return 值为 false，那么我们可以确定它被替换为 true。我们还可以确定我们是运行的 "first" 线程进入这种情况，并且所有其他执行类似操作的线程的 return 值为 true，而不是 false。

这是一个有效的例子：

$ cat t327.cu
#include <stdio.h>

__global__ void k(){

  __shared__ int flag;
  if (threadIdx.x == 0) flag = 0;
  __syncthreads();

  int retval = atomicCAS(&flag, 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
  // could do if statement on retval here
}


int main(){

  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$

回应评论中的一个问题，我们可以通过创建一个仿照 the programming guide 中给出的 double atomicAdd() 函数的任意原子操作将其扩展为 char 大小的标志。基本思想是我们将使用支持的数据大小（例如 unsigned）执行 atomicCAS，我们将转换所需的操作以有效支持 char 大小。这是通过将 char 地址转换为适当对齐的 unsigned 地址，然后移动 char 数量以在 [=16= 中的适当字节位置排列来完成的]值。

这是一个有效的例子：

$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned mask = 0xFFU;
  mask <<= al_offset;
  mask = ~mask;
  unsigned sval = val;
  sval <<= al_offset;
  unsigned old = *al_addr, assumed, setval;
  do {
        assumed = old;
        setval = assumed & mask;
        setval |= sval;
        old = atomicCAS(al_addr, assumed, setval);
    } while (assumed != old);
  return (char) ((assumed >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$

以上是 char 大小的通用 atomicCAS。这将允许您将任何 char 值交换为任何其他 char 值。在您的特定情况下，如果您只需要有效的布尔标志，则可以使用 atomicOr 提高此操作的效率，如评论中已经提到的那样。使用 atomicOr 可以消除上面自定义原子函数中的循环。这是一个有效的例子：

$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned my_bit = 1U << al_offset;
  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$

这些 char 原子方法假定您已经分配了一个大小为 4 的倍数的 char 数组。使用 char 数组执行此操作是无效的例如，大小 3（并且只有 3 个线程）。

CUDA 字节原子操作只导致一个线程动作

CUDA byte atomic operation to cause only one thread to act

cuda

atomic

shared-memory