使用 --device-debug 编译时是否可以更改 CUDA 线程块的调度顺序？

Question

精简版

我有一个启动很多块的内核，我知道 blockIdx.y = 312 发生了非法内存读取。运行它在 cuda-gdb 下导致一次顺序执行第 16 个块，执行到这个块索引需要很长时间，即使有条件断点也是如此。

当运行在 cuda-gdb 下时，有什么方法可以改变线程块的顺序吗？如果没有，是否还有其他我可能错过的调试策略？

加长版

我有一个基线卷积 CUDA 内核，它通过启动更多块来随着问题大小进行扩展。对于尺寸为 10_000 x 10_000 的输入图像，存在错误。运行它在cuda-memcheck下，我看到了下面的

...
========= Invalid __global__ read of size 4
=========     at 0x00000150 in convolution_kernel_sharedmem(float*, float*, float*)
=========     by thread (30,31,0) in block (0,312,0)
...

所有非法访问似乎都发生在 blockDim.y = 312。因此，在运行和 cuda-gdb 上，从 (0, 0, 0) 开始一次启动 16 个区块。我已经在内核中设置了一个条件断点以在所需的块索引处停止，但是到达那里需要很长时间。

有什么办法可以改变线程块在设备上的调度顺序吗？如果没有，是否有任何我可能错过的替代调试策略？

P.S：我知道我可以使用网格跨度循环而不是启动这么多块，但我想知道这个特定实现有什么问题。

Answer 1

Is there any way to change the order in thread blocks are scheduled when running under cuda-gdb?

有no way to change the threadblock scheduling order unless you want to rewrite the code, and 个。请注意，该链接示例并不完全是如何重新定义线程块调度顺序，但它具有所有必要的成分。在实践中，我没有看到很多人想要进行这种级别的重构，但为了完整起见，我提到了它。

If not, is there any other debugging strategy that I might have missed?

描述的方法here可以将您的错误定位到特定的内核代码行。从那里你可以使用例如conditioned printf 以识别非法索引计算等。请注意，对于该方法，无需使用调试开关编译代码，但需要使用 -lineinfo.

进行编译

此 training topic 提供更长时间的 CUDA 调试处理。

使用 --device-debug 编译时是否可以更改 CUDA 线程块的调度顺序？

Is it possible to change the order in which CUDA thread blocks are scheduled when compiled with `--device-debug`?

cuda

gpu

gpgpu

cuda-gdb

精简版

加长版