CUDA 中的随机播放指令不起作用

Shuffle instruction in CUDA not working

我在 CUDA 5.0 中遇到随机播放指令的问题。

这是我的内核片段。它在循环内。打印仅用于调试目的,因为我不能使用普通调试器:

...
tex_val = tex2D(srcTexRef, threadIdx.x + w, y_pos);
if (threadIdx.x == 0)
{
    left = left_value[y_pos];
}
else
{
    printf("thread %d; shfl value: %f \n", threadIdx.x, __shfl_up(value, 1));
    left = __shfl_up(value, 1);
}

printf("thread %d; value: %f; tex_val: %f; left: %f \n", threadIdx.x, value, tex_val, left);
...

由此我得到了这个输出:

l0:  ITERATION 1
l1:  thread 0; value: 0; tex_val: 1; left: 4
l2: 
l3:  ITERATION 2
l4:  thread 1; shfl value: 0
l5:  thread 0; value: 5; tex_val: 1; left: 5
l6:  thread 1; value: 0; tex_val: 1; left: 0
l7: 
l8:  ITERATION 3
l9:  thread 1; shfl value: 0
l10: thread 2; shfl value: 1
l11: thread 0; value: 6; tex_val: 1; left: 6
l12: thread 1; value: 1; tex_val: 1; left: 0
l13: thread 2; value: 2; tex_val: 1; left: 1
...

从输出中我可以看到线程 1 在任何迭代中都没有从线程 0 获取值,即使我可以清楚地看到它具有值(第 4 行 - shfl 值为 0;第 5 行 - 值为 5 ).线程 2 和更高的线程可以从较低的线程中获取值。我在哪里犯错?是因为分支的原因吗?

是的,这是因为分支。引用自 CUDA programming guide B.14.2:

The __shfl() intrinsics permit exchanging of a variable between threads within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp, ...

Threads may only read data from another thread which is actively participating in the __shfl() command. If the target thread is inactive, the retrieved value is undefined.

在一个分支中,活动线程是那些执行相同路径的线程,而那些执行不同路径的线程是不活动的。在您的情况下,线程 0 处于非活动状态,因此您无法从中随机播放。