如何 stop/cancel Numba 在结束前启动的 cuda 内核？

Question

我有一个用 Python/Numba 编写的模拟，它使用多个 cuda GPU。每个都是使用单独的 cuda 上下文从不同的进程启动的。这个模拟运行了一个很长的循环，最后将结果报告给存储目前最好结果的父进程，然后进程继续。

当 GPU/进程完成其内核并报告新的最佳结果时，我喜欢终止其他进程/GPU 上的内核执行，这样他们就可以获取这个新的最佳结果并对其进行迭代，而不是等待让他们完成。每次执行可能需要 30 分钟，所以如果我能杀死一个刚刚开始的并用更好的数据再次执行，那会节省我很多时间。

我似乎找不到停止启动的 cuda 内核的方法。

能做到吗？

我正在使用 Numba 0.51。

Answer 1

在没有以下情况的情况下，无法在 CUDA 中停止运行内核：

来自内核代码本身的帮助 (或)
破坏 CUDA 上下文，使任何后续 CUDA 操作失败

项目 2 不令人满意，因此要“异步”停止一个运行内核，将需要内核代码（所有线程）“轮询”给出停止指示的位置。

有一个内存位置来执行此操作的典型方法是在 CUDA 中使用 pinned/zero-copy 技术。在 numba 中，这种类型的内存是使用 mapped memory. Such memory is accessible from both host and device at the same time. An additional wrinkle is that we require the device code to not cache any copies of the memory locations used for communication. The only method I found in numba to accomplish this is to use atomics.

分配的

这是一个结合了这些想法的实例：

$ cat t51.py
import numpy as np
import numba as nb

from numba import cuda

@cuda.jit
def test(arr):
    while nb.cuda.atomic.max(arr, 0, 0) < 1: #poll for signal to stop
        nb.cuda.atomic.add(arr, 1, 1)        #do "other work"
    arr[2] = 1                               #acknowledge stop signal

if __name__ == '__main__':

    arr = nb.cuda.mapped_array(3, dtype=np.int32)
    arr[0] = 0   # stop signal goes here
    arr[1] = 1   # monitoring "other work"
    arr[2] = 0   # acknowledgment of stop signal
    my_str = nb.cuda.stream()
    griddim = (1,1)
    blockdim = (1,1,1)
    test[griddim, blockdim, my_str](arr)   # launch work to be done
    for i in range(1000):  # for demo, give kernel time to start
        if arr[1] < 2:
            print(arr[1])
    print(arr[0])
    while arr[2] != 1:     # send stop signal, and wait for acknowledgment
        arr[0] = 1
    print(arr[0])          # for demo
    nb.cuda.synchronize()  # if stop is working correctly code will not hang here
    print(arr[0])          # for demo
    print(arr[1])
$ python t51.py
0
1
1
1600
$

如何 stop/cancel Numba 在结束前启动的 cuda 内核？

How to stop/cancel a cuda kernel launched by Numba before it ends?

cuda

gpgpu

multiprocessing

numba

python-multiprocessing