__shfl_up_sync 调用中的掩码自适应吗？

Question

基本上，它是的实体化版本。假设一个 warp 需要处理 4 个对象（比如图像中的像素），每 8 个 lane 组合在一起处理一个对象：现在我需要在处理一个对象（即在该对象的 8 个通道中）期间进行内部随机播放操作，它适用于每个对象 只需将 mask 设置为 0xff：

uint32_t mask = 0xff;
__shfl_up_sync(mask,val,1);

不过，据我了解，设置mask为0xff会强制object0（或object3？也卡在这一点上）的lane0:lane7参与，但我确保经过大量试验后，以上用法适用于每个对象。所以，我的问题是 __shfl_up_sync 调用是否可以调整参数 mask 以强制相应的车道参与？

更新
其实这个问题出在libSGM that I tried to parse. In particular, it solves minimal cost path with dynamic programming in a decently parallel way. Once program reaches this line after launching the kernel aggregate_vertical_path_kernel的代码，执行配置：

//MAX_DISPARITY is 128 and BLOCK_SIZE is 256
//Basically, each block serves to process 32 pixels in which each warp serves to process 4.
const int gdim = (width + PATHS_PER_BLOCK - 1) / PATHS_PER_BLOCK;
const int bdim = BLOCK_SIZE;
aggregate_vertical_path_kernel<1, MAX_DISPARITY><<<gdim, bdim, 0, stream>>>(...)

对象 dp 从 DynamicProgramming<DP_BLOCK_SIZE, SUBGROUP_SIZE>:

实例化

static constexpr unsigned int DP_BLOCK_SIZE = 16u;
...
//MAX_DISPARITY is 128
static const unsigned int SUBGROUP_SIZE = MAX_DISPARITY / DP_BLOCK_SIZE;
...
DynamicProgramming<DP_BLOCK_SIZE, SUBGROUP_SIZE> dp;

继续执行程序，会调用dp.updata()，其中__shfl_up_sync用于访问上一个DP_BLOCK的最后一个元素，__shfl_down_sync用于访问后部第一个元素DP_BLOCK。此外，一个warp中的每8个通道被分组在一起：

//So each 8 threads are grouped together to process one pixel in which each lane is contributed to one DP_BLOCK for corresponding pixel.
const unsigned int lane_id = threadIdx.x % SUBGROUP_SIZE;

它来了，一旦程序达到这个 line:

//mask is specified as 0xff(255)
const uint32_t prev =__shfl_up_sync(mask, dp[DP_BLOCK_SIZE - 1], 1);

一个 warp 中的每个通道确实使用相同的掩码进行洗牌 0xff，这导致了我的上述问题。

Answer 1

这样做会让人感到困惑：

lane0:lane7 | lane0:lane7 | lane0:lane7 | lane0:lane7

因为 warp 没有 4 组泳道，编号为泳道 0 到泳道 7。它只有一组泳道，编号为 0 到泳道 31。

lane 31 | lane 30 | ... | lane 0

请注意，我以这种方式对通道进行排序，因为它对应于 mask 中的位顺序。哪个位对应哪个通道应该是显而易见的。 mask 参数中的位 0 对应于通道 0，依此类推。

由于您在 mask:

中仅指定了 8 位，即 8 条通道，这一事实使这种混淆更加复杂

uint32_t mask = 0xff;

如果您希望 warp 有正确的可能性使用所有 32 个通道来处理所有 4 个对象，您必须指定 32 位 mask:

uint32_t mask = 0xffffffff;

没有 "adaptation" 的 8 位 mask 应用于 warp 中的每组 8 通道。您必须为 32 条泳道中的每一条明确指定 mask。即使使用 width 参数也是如此（见下文）。

如果您想让随机播放操作仅在 8 位组（具有 4 次逻辑随机播放）中工作，这就是 width parameter 的用途：

T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
                                                               ^^^^^

__shfl_up_sync 调用中的掩码自适应吗？

Is mask adaptive in __shfl_up_sync call?

cuda

shuffle

intrinsics