CUDA 中的随机播放指令不起作用

Question

我在 CUDA 5.0 中遇到随机播放指令的问题。

这是我的内核片段。它在循环内。打印仅用于调试目的，因为我不能使用普通调试器：

...
tex_val = tex2D(srcTexRef, threadIdx.x + w, y_pos);
if (threadIdx.x == 0)
{
    left = left_value[y_pos];
}
else
{
    printf("thread %d; shfl value: %f \n", threadIdx.x, __shfl_up(value, 1));
    left = __shfl_up(value, 1);
}

printf("thread %d; value: %f; tex_val: %f; left: %f \n", threadIdx.x, value, tex_val, left);
...

由此我得到了这个输出：

l0:  ITERATION 1
l1:  thread 0; value: 0; tex_val: 1; left: 4
l2: 
l3:  ITERATION 2
l4:  thread 1; shfl value: 0
l5:  thread 0; value: 5; tex_val: 1; left: 5
l6:  thread 1; value: 0; tex_val: 1; left: 0
l7: 
l8:  ITERATION 3
l9:  thread 1; shfl value: 0
l10: thread 2; shfl value: 1
l11: thread 0; value: 6; tex_val: 1; left: 6
l12: thread 1; value: 1; tex_val: 1; left: 0
l13: thread 2; value: 2; tex_val: 1; left: 1
...

从输出中我可以看到线程 1 在任何迭代中都没有从线程 0 获取值，即使我可以清楚地看到它具有值（第 4 行 - shfl 值为 0；第 5 行 - 值为 5 ).线程 2 和更高的线程可以从较低的线程中获取值。我在哪里犯错？是因为分支的原因吗？

Answer 1

是的，这是因为分支。引用自 CUDA programming guide B.14.2:

The __shfl() intrinsics permit exchanging of a variable between threads within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp, ...

和

Threads may only read data from another thread which is actively participating in the __shfl() command. If the target thread is inactive, the retrieved value is undefined.

在一个分支中，活动线程是那些执行相同路径的线程，而那些执行不同路径的线程是不活动的。在您的情况下，线程 0 处于非活动状态，因此您无法从中随机播放。

CUDA 中的随机播放指令不起作用

Shuffle instruction in CUDA not working

c++

cuda

shuffle