异步内核启动后返回 pyCUDA 中的主机代码

Question

我正在尝试在 pyCUDA 中启动内核，然后通过写入 GPU 全局内存位置来终止内核。这是一个简单的示例内核，我希望它能够在进入无限 while 循环后的某个时刻终止：

__global__ void countUp(u16 *inShot, u64 *counter) {
  while(inShot[0]) {
    counter[0]++;
  }
}

根据我在 CUDA 中阅读的有关流的内容，我应该能够在创建流后启动该内核，并且它将在主机上是非阻塞的，即。在这个内核启动后我应该能够在主机上做一些事情并且是运行。我将上面的内核编译成一个 cubin 文件，然后像这样在 pyCUDA 中启动它：

import numpy as np
from pycuda import driver, compiler, gpuarray, tools
# -- initialize the device
import pycuda.autoinit

strm1 = driver.Stream()

h_inShot = np.zeros((1,1))
d_inShot = gpuarray.to_gpu_async(h_inShot.astype(np.uint16), stream = strm1)
h_inShot = np.ones((1,1))
h_counter = np.zeros((1,1))
d_counter = gpuarray.to_gpu_async(h_counter.astype(np.uint64), stream = strm1)

testCubin = "testKernel.cubin"
mod = driver.module_from_file(testCubin)
countUp = mod.get_function("countUp")

countUp(d_inShot, d_counter,
        grid = (1, 1, 1),
        block = (1, 1, 1),
        stream = strm1
        )

运行由于显而易见的原因，此脚本导致内核进入无限循环。在内核启动后，从 ipython 环境启动此脚本似乎无法 return 控制主机（我无法输入新命令，因为我猜它正在等待内核完成）。我想控制 return 到主机，以便我可以更改 GPU 全局内存指针 d_inShot 中的值并让内核退出 while 循环。这甚至可能吗？如果是的话，我该如何在 pyCUDA 中做到这一点？谢谢

Answer 1

我想通了，所以我发布了我的解决方案。尽管异步 memcpy 是非阻塞的，但我发现使用与运行内核相同的流来执行 memcpy 是行不通的。我的解决方案是创建另一个流：

strm2 = driver.Stream()

然后像这样更改 d_inShot：

d_inShot.set_async(h_inShot.astype(np.uint16), stream = strm2)

这对我有用。

异步内核启动后返回 pyCUDA 中的主机代码

Returning to host code in pyCUDA after asynchronous kernel launch

python

cuda

ipython

python-2.7

pycuda