如果无法在条件分支内调用 __syncthreads，如何在 CUDA 中减少？

Question

缩减方法 suggested by NVIDIA 在条件分支内使用 __syncthreads() 例如：

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

或

for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
    if (tid < s)
        sdata[tid] += sdata[tid + s];
    __syncthreads();
}

第二个例子中__syncthreads()在for循环体内，也是一个条件分支。

然而，关于SO的一些问题提出了条件分支中 __syncthreads() 的问题（例如 Can I use __syncthreads() after having dropped threads? and conditional syncthreads & deadlock (or not) ），并且答案说条件分支中的 __syncthreads() 可能导致一个僵局。因此，NVIDIA 建议的缩减方法可能会陷入僵局（如果相信答案所依据的文档）。

此外，如果_syncthreads()不能在条件分支中使用，那么恐怕很多基本操作都会被阻塞，归约只是一个例子。

那么如何在不在条件分支中使用 __syncthreads() 的情况下减少 CUDA？或者它是文档中的错误？

Answer 1

没有限制

__syncthreads cannot be used in conditional branches

限制是

__syncthreads cannot be used in branches which will not be traversed by all threads at the same time

请注意，在您给出的示例中，__syncthreads 未包含在取决于线程 ID（或某些每线程数据）的条件中.在第一种情况下，blockSize 是一个不依赖于线程 ID 的模板参数。在第二种情况下，它同样在 if.

之后

是的，for 循环的 s > 32 是一个条件，但它是一个真值不以任何方式依赖于线程或其数据的条件。 blockdim.x 对所有线程都是一样的。并且所有线程都将执行与 s 完全相同的修改。这意味着所有线程将在其控制流的同一点到达 __syncthreads。完全没问题。

另一种情况，您不能使用__syncthreads，这种情况可能对某些线程为真，对其他线程为假。在这种情况下，您必须关闭所有条件才能使用 __syncthreads。所以不是这个：

if (threadIdx.x < SOME_CONSTANT)
{
  operation1();
  __syncthreads();
  operation2();
}

你必须这样做：

if (threadIdx.x < SOME_CONSTANT)
{
  operation1();
}
__syncthreads();
if (threadIdx.x < SOME_CONSTANT)
{
  operation2();
}

您给出的两个示例也证明了这一点：线程 ID 相关条件在调用 __syncthreads 之前关闭。

如果无法在条件分支内调用 __syncthreads，如何在 CUDA 中减少？

How to reduce in CUDA if __syncthreads can't be called inside conditional branches?

c++

parallel-processing

reduce

synchronization

cuda