Pycuda：多次调用内核的最佳方式

Question

我正在使用 pycuda 制作相对论光线追踪器。基本上，对于大型二维数组中的每个 "pixel"，我们必须使用 Runge Kutta 求解一个包含 6 个 ODE 的系统。由于每个集成都独立于其他集成，因此应该非常容易。其他人已经使用 C/C++ CUDA 实现了它并取得了很好的效果（参见 this project）。

问题在于我不知道如何做才是最好的。我正在编写一个内核，该内核执行一些 Runge Kutta 步骤，然后 return 结果到 CPU。为了整合整个光线，这个内核被调用了很多次。问题是由于某种原因 非常慢 。当然，我知道内存传输确实是 CUDA 中的一个瓶颈，但由于这真的很慢，我开始认为我做错了什么。

如果您能向我推荐针对此案例的最佳编程实践，那就太好了。（使用pycuda）。我正在徘徊的一些事情：

是否需要在到达内核调用时创建新上下文？
有一种方法不必将内存从 GPU 传输到 CPU，即启动内核，暂停它以获得一些信息，重述并重复。
每次 RK4 迭代大约需要半秒，这是疯狂的（也与 link 中执行类似操作的 CUDA 代码相比）。而且我认为这是由于我使用 pycuda 的方式有问题，所以如果你能以最好的方式解释执行此类操作的最佳方法，那就太好了！

澄清一下：我必须 pause/restart 内核的原因是看门狗。超过 10 秒的内核被看门狗杀死。

提前致谢！

Answer 1

你的主要问题好像太笼统了，不看代码很难给出具体的建议。我会尽量回答你的子问题（不是真正的答案，但评论有点长）

Do I need to create a new context on reach Kernel call?

没有。

There is a way to not have to transfer memory from GPU to CPU, that is, starting a Kernel, pausing it to get some information, restating it and repeat.

取决于您所说的 "get some information" 是什么意思。如果这意味着在 CPU 上用它做事，那么，当然，你必须转移它。如果你想在另一个内核调用中使用，那么你不需要传输它。

Each RK4 iteration takes roughly half a second, which is insane (also compared with the CUDA code in the link that does some similar operation).

这实际上取决于方程式、线程数和您使用的视频卡。我可以想象一个RK步骤需要那么长时间的情况。

And I think this is due to something wrong with the way I'm using pycuda, so if you can explain the best way to do such an operation in the best manner, it could be great!.

没有代码就无法确定。尝试创建一些最小的演示示例，或者至少，post 一个 link 到一个 runnable（即使它相当长）的一段代码说明了你的问题。至于 PyCUDA，它是 CUDA 的一个非常薄的包装器，适用于后者的所有编程实践也适用于前者。

Answer 2

我可能会帮助您处理内存，即在迭代期间不必从 CPU 复制到 GPU。我正在使用欧拉时间步进随着时间的推移发展一个系统，下面给出了我将所有数据保存在我的 GPU 上的方式。但是，这样做的问题是，一旦启动了第一个内核，cpu 就会继续执行它之后的行。 IE。边界内核在时间演化步骤之前启动。

我需要的是一种同步事物的方法。我已经尝试使用 strm.synchronize() （请参阅我的代码）来完成它，但它并不总是有效。如果您对此有任何想法，我将非常感谢您的意见！谢谢！

def curveShorten(dist,timestep,maxit):
"""
iterates the function image on a 2d grid through an euler anisotropic
diffusion operator with timestep=timestep maxit number of times
"""
image = 1*dist
forme = image.shape
if(np.size(forme)>2):
    sys.exit('Only works on gray images')

aSize = forme[0]*forme[1]
xdim  = np.int32(forme[0])
ydim  = np.int32(forme[1])  


image[0,:]      = image[1,:]
image[xdim-1,:] = image[xdim-2,:]
image[:,ydim-1] = image[:,ydim-2]
image[:,0]      = image[:,1]

#np arrays  i need to store things on the CPU, image is the initial 
#condition and final is the final state
image = image.reshape(aSize,order= 'C').astype(np.float32)
final = np.zeros(aSize).astype(np.float32)

#allocating memory to GPUs
image_gpu = drv.mem_alloc(image.nbytes)
final_gpu = drv.mem_alloc(final.nbytes)

#sending data to each memory location
drv.memcpy_htod(image_gpu,image) #host to device copying
drv.memcpy_htod(final_gpu,final)

#block size: B := dim1*dim2*dim3=1024
#gird size : dim1*dimr2*dim3 = ceiling(aSize/B)
blockX     = int(1024)
multiplier = aSize/float(1024)   
if(aSize/float(1024) > int(aSize/float(1024)) ):
    gridX = int(multiplier + 1)
else:
    gridX = int(multiplier)
strm1 = drv.Stream(1)
ev1   = drv.Event()
strm2 = drv.Stream()
for k in range(0,maxit):

    Kern_diffIteration(image_gpu,final_gpu,ydim, xdim, np.float32(timestep), block=(blockX,1,1), grid=(gridX,1,1),stream=strm1)
    strm1.synchronize()

    if(strm1.is_done()==1):
     Kern_boundaryX0(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))
     Kern_boundaryX1(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm1)
     Kern_boundaryY0(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm2)
     Kern_boundaryY1(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm1)

    if(strm1.is_done()==1): 
      drv.memcpy_dtod(image_gpu, final_gpu, final.nbytes)
    #Kern_copy(image_gpu,final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1),stream=strm1)



drv.memcpy_dtoh(final,final_gpu) #device to host copying
#final_gpu.free()
#image_gpu.free()


return final.reshape(forme,order='C')

Pycuda：多次调用内核的最佳方式

Pycuda: Best way of calling Kernel multiple times

python

cuda

pycuda