CUDA 字节原子操作只导致一个线程动作
CUDA byte atomic operation to cause only one thread to act
我正在编写一个 CUDA 程序,它在共享内存中定义了一个数组。我需要做的是只允许一个线程写入这个数组中的每个索引,i。 e.到达此写入指令的第一个线程应更改其值,但同一 warp 或下一个 warp 中的任何其他线程都应读取写入的值。
这是代码片段:
char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
printf("copy seq_shared seq_1_index = %d, block = %d \n", seq_1_index, blockIdx.x);
}
现在发生的事情是 warp 中的所有线程都执行这些精确的指令序列,因此 if 条件中的剩余代码被执行了 32 次。我只需要执行一次。
我怎样才能做到这一点?
您可以为此使用 atomicCAS()
。它执行原子比较和交换操作。
此函数将测试一个变量,如果它符合特定条件(例如,false),它将用另一个值(例如,true)替换它。它将自动完成所有这些事情,即没有中断的可能性。
原子函数的 return 值在这种情况下为我们提供了有用的信息。如果上面示例中的 return 值为 false,那么我们可以确定它被替换为 true。我们还可以确定我们是 运行 的 "first" 线程进入这种情况,并且所有其他执行类似操作的线程的 return 值为 true,而不是 false。
这是一个有效的例子:
$ cat t327.cu
#include <stdio.h>
__global__ void k(){
__shared__ int flag;
if (threadIdx.x == 0) flag = 0;
__syncthreads();
int retval = atomicCAS(&flag, 0, 1);
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
// could do if statement on retval here
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
回应评论中的一个问题,我们可以通过创建一个仿照 the programming guide 中给出的 double atomicAdd()
函数的任意原子操作将其扩展为 char
大小的标志。基本思想是我们将使用支持的数据大小(例如 unsigned
)执行 atomicCAS,我们将转换所需的操作以有效支持 char
大小。这是通过将 char
地址转换为适当对齐的 unsigned
地址,然后移动 char
数量以在 [=16= 中的适当字节位置排列来完成的]值。
这是一个有效的例子:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned mask = 0xFFU;
mask <<= al_offset;
mask = ~mask;
unsigned sval = val;
sval <<= al_offset;
unsigned old = *al_addr, assumed, setval;
do {
assumed = old;
setval = assumed & mask;
setval |= sval;
old = atomicCAS(al_addr, assumed, setval);
} while (assumed != old);
return (char) ((assumed >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
以上是 char
大小的通用 atomicCAS
。这将允许您将任何 char
值交换为任何其他 char
值。在您的特定情况下,如果您只需要有效的布尔标志,则可以使用 atomicOr
提高此操作的效率,如评论中已经提到的那样。使用 atomicOr
可以消除上面自定义原子函数中的循环。这是一个有效的例子:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned my_bit = 1U << al_offset;
return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
这些 char
原子方法假定您已经分配了一个大小为 4 的倍数的 char
数组。使用 char
数组执行此操作是无效的例如,大小 3(并且只有 3 个线程)。
我正在编写一个 CUDA 程序,它在共享内存中定义了一个数组。我需要做的是只允许一个线程写入这个数组中的每个索引,i。 e.到达此写入指令的第一个线程应更改其值,但同一 warp 或下一个 warp 中的任何其他线程都应读取写入的值。
这是代码片段:
char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
printf("copy seq_shared seq_1_index = %d, block = %d \n", seq_1_index, blockIdx.x);
}
现在发生的事情是 warp 中的所有线程都执行这些精确的指令序列,因此 if 条件中的剩余代码被执行了 32 次。我只需要执行一次。
我怎样才能做到这一点?
您可以为此使用 atomicCAS()
。它执行原子比较和交换操作。
此函数将测试一个变量,如果它符合特定条件(例如,false),它将用另一个值(例如,true)替换它。它将自动完成所有这些事情,即没有中断的可能性。
原子函数的 return 值在这种情况下为我们提供了有用的信息。如果上面示例中的 return 值为 false,那么我们可以确定它被替换为 true。我们还可以确定我们是 运行 的 "first" 线程进入这种情况,并且所有其他执行类似操作的线程的 return 值为 true,而不是 false。
这是一个有效的例子:
$ cat t327.cu
#include <stdio.h>
__global__ void k(){
__shared__ int flag;
if (threadIdx.x == 0) flag = 0;
__syncthreads();
int retval = atomicCAS(&flag, 0, 1);
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
// could do if statement on retval here
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
回应评论中的一个问题,我们可以通过创建一个仿照 the programming guide 中给出的 double atomicAdd()
函数的任意原子操作将其扩展为 char
大小的标志。基本思想是我们将使用支持的数据大小(例如 unsigned
)执行 atomicCAS,我们将转换所需的操作以有效支持 char
大小。这是通过将 char
地址转换为适当对齐的 unsigned
地址,然后移动 char
数量以在 [=16= 中的适当字节位置排列来完成的]值。
这是一个有效的例子:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned mask = 0xFFU;
mask <<= al_offset;
mask = ~mask;
unsigned sval = val;
sval <<= al_offset;
unsigned old = *al_addr, assumed, setval;
do {
assumed = old;
setval = assumed & mask;
setval |= sval;
old = atomicCAS(al_addr, assumed, setval);
} while (assumed != old);
return (char) ((assumed >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
以上是 char
大小的通用 atomicCAS
。这将允许您将任何 char
值交换为任何其他 char
值。在您的特定情况下,如果您只需要有效的布尔标志,则可以使用 atomicOr
提高此操作的效率,如评论中已经提到的那样。使用 atomicOr
可以消除上面自定义原子函数中的循环。这是一个有效的例子:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned my_bit = 1U << al_offset;
return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
这些 char
原子方法假定您已经分配了一个大小为 4 的倍数的 char
数组。使用 char
数组执行此操作是无效的例如,大小 3(并且只有 3 个线程)。