在一维网格中计算 warp id / lane id 的最有效方法是什么？

Question

在CUDA中，每个线程都知道它在网格中的块索引和块内的线程索引。但是两个重要的值似乎并没有明确提供给它：

它的索引作为其扭曲内的一条通道（它的 "lane id"）
作为块内通道的 warp 的索引（它的 "warp id"）

假设网格是一维的（a.k.a。线性的，即blockDim.y和blockDim.z为1），显然可以得到如下：

enum : unsigned { warp_size = 32 };
auto lane_id = threadIdx.x % warp_size;
auto warp_id = threadIdx.x / warp_size;

如果您不相信编译器会对其进行优化，您可以将其重写为：

enum : unsigned { warp_size = 32, log_warp_size = 5 };
auto lane_id = threadIdx.x & (warp_size - 1);
auto warp_id = threadIdx.x >> log_warp_size;

这是最有效的做法吗？每个线程都必须计算这个似乎仍然很浪费。

_{（灵感来自。）}

Answer 1

朴素计算是目前最有效的。

注意：此答案经过大量编辑。

尝试完全避免计算是非常诱人的 - 因为如果您深入了解，这两个值似乎已经可用。

你看，nVIDIA GPU 有特殊的寄存器，你的（编译）代码可以读取这些寄存器来访问各种有用的信息。一个这样的寄存器保存 threadIdx.x；另一个持有 blockDim.x；另一个 - 时钟滴答计数；等等。显然，C++ 作为一种语言并没有公开这些内容；事实上，CUDA 也没有。然而，CUDA 代码编译成的中间表示，命名为 PTX, does expose these special registers（自 PTX 1.3 起，即 CUDA 版本 >= 2.1）。

其中两个特殊寄存器是 %warpid 和 %laneid。现在，CUDA 支持使用 asm 关键字在 CUDA 代码中内联 PTX 代码——就像它可以用于主机端代码直接发出 CPU 汇编指令一样。通过这种机制，可以使用这些特殊寄存器：

__forceinline__ __device__ unsigned lane_id()
{
    unsigned ret; 
    asm volatile ("mov.u32 %0, %laneid;" : "=r"(ret));
    return ret;
}

__forceinline__ __device__ unsigned warp_id()
{
    // this is not equal to threadIdx.x / 32
    unsigned ret; 
    asm volatile ("mov.u32 %0, %warpid;" : "=r"(ret));
    return ret;
}

...但是这里有两个问题

第一个问题 - 正如@Patwie 所建议的那样 - 是 %warp_id 没有给你你真正想要的 - 它不是网格上下文中的扭曲索引，而是在物理 SM（一次可以容纳如此多的经线），而这两者并不相同。所以不要使用%warp_id.

至于 %lane_id，它确实为您提供了正确的值，但几乎肯定会损害您的性能：即使它是一个“寄存器”，它也不像寄存器文件中的常规寄存器， 1 个周期的访问延迟。它是一个特殊的寄存器，在实际硬件中是retrieved using an S2R instruction，可以表现出较长的延迟。由于您几乎可以肯定已经在寄存器中拥有 threadIdx.x 的值，因此对该值应用位掩码比检索 %lane_id.

更快
底线：只需根据线程 ID 计算 warp ID 和 lane ID。我们暂时无法解决这个问题。

Answer 2

另一个答案是非常危险！自己计算 lane-id 和 warp-id。

#include <cuda.h>
#include <iostream>

inline __device__ unsigned get_lane_id() {
  unsigned ret;
  asm volatile("mov.u32 %0, %laneid;" : "=r"(ret));
  return ret;
}

inline __device__ unsigned get_warp_id() {
  unsigned ret;
  asm volatile("mov.u32 %0, %warpid;" : "=r"(ret));
  return ret;
}

__global__ void kernel() {
  const int actual_warpid = get_warp_id();
  const int actual_laneid = get_lane_id();
  const int expected_warpid = threadIdx.x / 32;
  const int expected_laneid = threadIdx.x % 32;
  if (expected_laneid == 0) {
    printf("[warp:] actual: %i  expected: %i\n", actual_warpid,
           expected_warpid);
    printf("[lane:] actual: %i  expected: %i\n", actual_laneid,
           expected_laneid);
  }
}

int main(int argc, char const *argv[]) {
  dim3 grid(8, 7, 1);
  dim3 block(4 * 32, 1);

  kernel<<<grid, block>>>();
  cudaDeviceSynchronize();
  return 0;
}

给出类似

的东西

[warp:] actual: 4  expected: 3
[warp:] actual: 10  expected: 0
[warp:] actual: 1  expected: 1
[warp:] actual: 12  expected: 1
[warp:] actual: 4  expected: 3
[warp:] actual: 0  expected: 0
[warp:] actual: 13  expected: 2
[warp:] actual: 12  expected: 1
[warp:] actual: 6  expected: 1
[warp:] actual: 6  expected: 1
[warp:] actual: 13  expected: 2
[warp:] actual: 10  expected: 0
[warp:] actual: 1  expected: 1
...
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0

另请参阅 PTX 文档

A predefined, read-only special register that returns the thread's warp identifier. The warp identifier provides a unique warp number within a CTA but not across CTAs within a grid. The warp identifier will be the same for all threads within a single warp.

Note that %warpid is volatile and returns the location of a thread at the moment when read, but its value may change during execution, e.g., due to rescheduling of threads following preemption.

因此，它是调度程序的warp-id，但不保证它与虚拟warp-id匹配（从0开始计数）。

The docs makes this clear:

For this reason, %ctaid and %tid should be used to compute a virtual warp index if such a value is needed in kernel code; %warpid is intended mainly to enable profiling and diagnostic code to sample and log information such as work place mapping and load distribution.

如果你认为，好吧，让我们为此使用 CUB：这甚至会影响 cub::WarpId()

Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.

编辑：使用 %laneid 似乎是安全的。

在一维网格中计算 warp id / lane id 的最有效方法是什么？

What's the most efficient way to calculate the warp id / lane id in a 1-D grid?

optimization

cuda

ptx

朴素计算是目前最有效的。