分配 gpu 内存缓冲区时出现 CUDA 无效资源句柄错误

Question

我在 gpu 上分配缓冲区时遇到 cuda 无效资源句柄错误。

1，我从git clone https://github.com/Funatiq/gossip.git下载代码。

2、我在gossip目录下建了这个项目：git submodule update --init && make。然后我在这里得到了编译二进制文件。

3，然后我为我的主GPU生成一个分散和收集计划，在这里，它是0。

$python3 scripts/plan_from_topology_asynch.py gather 0

$python3 scripts/plan_from_topology_asynch.py scatter 0

然后生成scatter_plan.json和gather_plan.json。

4、最后，我执行计划： ./execute scatter_gather scatter_plan.json gather_plan.json

错误指向 code 的这些行：

std::vector<size_t> bufs_lens_scatter = scatter.calcBufferLengths(table[main_gpu]);
    print_buffer_sizes(bufs_lens_scatter);

    std::vector<data_t *> bufs(num_gpus);
    std::vector<size_t> bufs_lens(bufs_lens_scatter);
    TIMERSTART(malloc_buffers)
    for (gpu_id_t gpu = 0; gpu < num_gpus; ++gpu) {
        cudaSetDevice(context.get_device_id(gpu)); CUERR
        cudaMalloc(&bufs[gpu], sizeof(data_t)*bufs_lens[gpu]); CUERR
    }
    TIMERSTOP(malloc_buffers)

详细错误显示为：

RUN: scatter_gather
INFO: 32768 bytes (scatter_gather)
TIMING: 0.463872 ms (malloc_devices)
TIMING: 0.232448 ms (zero_gpu_buffers)
TIMING: 0.082944 ms (init_data)
TIMING: 0.637952 ms (multisplit)

Partition Table:
470 489 534 553 514 515 538 483
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0

Required buffer sizes:
0 538 717 604 0 344 0 687

TIMING: 3.94455e-31 ms (malloc_buffers)
CUDA error: invalid resource handle : executor.cuh, line 405

作为参考，我附上了完整的错误报告here。奇怪的是作者无法在他的服务器上重现这些错误。但是当我运行它在具有 8 个 GPU 的 DGX 工作站上时。出现此错误。怀疑是cuda编程错误还是环境问题

Answer 1

该代码在处理 TIMERSTART 和 TIMERSTOP 宏定义的 here and used here 中使用的 cudaEventRecord() 时存在缺陷（使用 malloc_buffers 标签).

CUDA 事件在创建时有一个 device association，隐式定义。这意味着它们与最近 cudaSetDevice() 呼叫选择的设备相关联。如编程指南中所述：

cudaEventRecord() will fail if the input event and input stream are associated to different devices.

（注意每个设备都有自己的空流——这些事件被记录到空流中）

如果我们运行带有 cuda-memcheck 的代码，我们会观察到无效资源句柄错误确实是通过调用 cudaEventRecord().

返回的

具体参考代码here:

...
std::vector<size_t> bufs_lens(bufs_lens_scatter);
TIMERSTART(malloc_buffers)
for (gpu_id_t gpu = 0; gpu < num_gpus; ++gpu) {
    cudaSetDevice(context.get_device_id(gpu)); CUERR
    cudaMalloc(&bufs[gpu], sizeof(data_t)*bufs_lens[gpu]); CUERR
}
TIMERSTOP(malloc_buffers)

TIMERSTART 宏定义并创建了 2 个 cuda 事件，它立即记录其中之一（启动事件）。 TIMERSTOP 宏使用在 TIMERSTART 宏中创建的计时器停止事件。但是，我们可以看到干预代码可能已将设备从创建这两个事件时生效的设备更改为（由于 for-loop 中的 cudaSetDevice 调用）。因此，cudaEventRecord（和 cudaEventElapsedTime）调用由于这种无效用法而失败。

作为证明点，当我添加 cudaSetDevice 调用宏定义时，如下所示：

    #define TIMERSTART(label)                                                  \
        cudaEvent_t timerstart##label, timerstop##label;                       \
        float timerdelta##label;                                               \
        cudaSetDevice(0); \
        cudaEventCreate(&timerstart##label);                                   \
        cudaEventCreate(&timerstop##label);                                    \
        cudaEventRecord(timerstart##label, 0);
#endif

#ifndef __CUDACC__
    #define TIMERSTOP(label)                                                   \
        stop##label = std::chrono::system_clock::now();                        \
        std::chrono::duration<double>                                          \
            timerdelta##label = timerstop##label-timerstart##label;            \
        std::cout << "# elapsed time ("<< #label <<"): "                       \
                  << timerdelta##label.count()  << "s" << std::endl;
#else
    #define TIMERSTOP(label)                                                   \
            cudaSetDevice(0); \
            cudaEventRecord(timerstop##label, 0);                              \
            cudaEventSynchronize(timerstop##label);                            \
            cudaEventElapsedTime(                                              \
                &timerdelta##label,                                            \
                timerstart##label,                                             \
                timerstop##label);                                             \
            std::cout <<                                                       \
                "TIMING: " <<                                                  \
                timerdelta##label << " ms (" <<                                \
                #label <<                                                      \
                ")" << std::endl;
#endif

代码运行对我来说没有错误。我并不是说这是正确的解决方法。正确的解决方法可能是在调用宏之前正确设置设备。很明显，宏编写者没有预料到这种用法，或者没有意识到这种危险。

我能想到的唯一不会发生错误的情况是在 single-device 系统中。当代码维护者回复您的问题说他们无法重现问题时，我猜他们没有在 multi-device 系统上测试代码。据我所知，这个错误在 multi-device 设置中是不可避免的。

分配 gpu 内存缓冲区时出现 CUDA 无效资源句柄错误

CUDA invalid resource handle error when allocating gpu memory buffer

cuda

gpu

gpgpu