限制 CUDA 中的寄存器使用：__launch_bounds__ vs maxrregcount

Question

Register usage can be controlled using the maxrregcount compiler option or launch bounds as described in Launch Bounds.

根据我的理解（如果我错了请纠正我），虽然 -maxrregcount 限制了整个 .cu 文件可能使用的寄存器数量，但 __launch_bounds__ 限定符定义了每个 __global__ 内核的 maxThreadsPerBlock 和 minBlocksPerMultiprocessor。这两个完成相同的任务，但是以两种不同的方式。

我的用法要求每个线程有 40 个寄存器以最大化性能。因此，我可以使用 -maxrregcount 40。我也可以使用 __launch_bounds__(256, 6) 强制 40 寄存器，但这会导致加载和存储寄存器溢出。

导致这些寄存器溢出的两者有什么区别？

Answer 1

这个问题的前言是，引用CUDA C Programming Guide，

the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance.

现在，__launch_bounds__ 和 maxregcount 通过两种不同的机制限制寄存器的使用。

__launch_bounds__

nvcc 通过平衡内核启动设置的性能和通用性来决定 __global__ 函数使用的寄存器数量。换句话说，对于每个块的不同线程数和每个多处理器的块数，这样选择使用的寄存器数量"guarantees effectiveness"。但是，如果在编译时可以获得每个块的最大线程数和（可能）每个多处理器的最小块数的近似概念，则可以使用此信息来优化此类启动的内核。也就是说

#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP     2

__global__ void
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
fooKernel(int *inArr, int *outArr)
{
    // ... Computation of kernel
}

通知编译器可能的启动配置，以便 nvcc 可以 select 以 "optimal" 方式注册此类启动配置的数量。

MAX_THREADS_PER_BLOCK参数是必须的，而MIN_BLOCKS_PER_MP参数是可选的。另请注意，如果内核启动时每个块的线程数大于 MAX_THREADS_PER_BLOCK，内核启动将失败。

限制机制在Programming Guide中描述如下：

If launch bounds are specified, the compiler first derives from them the upper limit L on the number of registers the kernel should use to ensure that minBlocksPerMultiprocessor blocks (or a single block if minBlocksPerMultiprocessor is not specified) of maxThreadsPerBlock threads can reside on the multiprocessor. The compiler then optimizes register usage in the following way:

If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at the expense of more local memory usage and/or higher number of instructions;

因此，__launch_bounds__ 会导致寄存器溢出。

maxrregcount

maxrregcount 是一个编译器标志，它通过强制编译器重新安排其对寄存器的使用，将使用的寄存器的数量硬限制为用户设置的数字，与 __launch_bounds__ 不同。当编译器不能保持在强加的限制之下时，它会简单地将它溢出到本地内存，实际上是 DRAM。即使这个局部变量存储在全局 DRAM 内存变量可以缓存在 L1, L2.

限制 CUDA 中的寄存器使用：__launch_bounds__ vs maxrregcount

Limiting register usage in CUDA: __launch_bounds__ vs maxrregcount

cuda

gpu

nvidia

nvcc