解读"too many resources for launch"

Interpretation of "too many resources for launch"

考虑以下 Python 代码:

from numpy import float64
from pycuda import compiler, gpuarray
import pycuda.autoinit

# N > 960 is crucial!
N = 961
code = """
__global__ void kern(double *v)
{
    double a = v[0]*v[2];
    double lmax = fmax(0.0, a), lmin = fmax(0.0, -a);
    double smax = sqrt(lmax),   smin = sqrt(lmin);

    if(smax > 0.2) {
        smax = fmin(smax, 0.2)/smax ;
        smin = (smin > 0.0) ? fmin(smin, 0.2)/smin : 0.0;
        smin = lmin + smin*a;

        v[0] = v[0]*smin + smax*lmax;
        v[2] = v[2]*smin + smax*lmax;
    }
}
"""
kernel_func = compiler.SourceModule(code).get_function("kern")
kernel_func(gpuarray.zeros(3, float64), block=(N,1,1))

执行此操作会得到:

Traceback (most recent call last):
  File "test.py", line 25, in <module>
    kernel_func(gpuarray.zeros(3, float64), block=(N,1,1))
  File "/usr/lib/python3.5/site-packages/pycuda/driver.py", line 402, in function_call
    func._launch_kernel(grid, block, arg_buf, shared, None)
pycuda._driver.LaunchError: cuLaunchKernel failed: too many resources requested for launch

我的设置:Python v3.5.2 with pycuda==2016.1.2 and numpy==1.11.1 on Ubuntu 16.04.1 (64-bit), kernel 4.4.0, nvcc V7.5.17.显卡是 Nvidia GeForce GTX 480。

你能在你的机器上重现吗?您知道导致此错误消息的原因吗?

备注:我知道原则上存在竞争条件,因为所有内核都试图更改 v[0] 和 v[2]。但是内核无论如何都不应该到达 if 块的内部!此外,我可以在没有竞争条件的情况下重现错误,但要复杂得多。

几乎可以肯定您正在达到每个块的寄存器数限制。

阅读 relevant documentation, your device has a limit of 32k 32 bit registers per block. When the block size is larger than 960 threads (30 warps), your kernel launch requires too many registers and the launch fails. NVIDIA supply an excel spreadsheet and advice 了解如何确定每个线程、内核的寄存器要求以及可用于内核在设备上启动的限制块大小。