cuda.local.array 在 numba 中的正确用法是什么？

Question

我在 python 中使用 numba 写了一个测试代码。

from numba import cuda
import numpy as np
import numba
@cuda.jit
def function(output, size, random_array):
    i_p, i_k1, i_k2 = cuda.grid(3)
    a=cuda.local.array(shape=1,dtype=numba.float64)
    if i_p<size and i_k1<size and i_k2<size:
        a1=i_p
        a2=i_k1+1
        a3=i_k2+2

        a[0]=a1
        a[1]=a2
        a[2]=a3
        for i in range(len(random_array)):
            output[i_p,i_k1,i_k2,i] = a[int(random_array[i])]
output=cuda.device_array((2,2,2,5))

random_array=np.array([np.random.random()*3 for i in range(5)])
print(random_array)
random_array0=cuda.to_device(random_array)
size=2
threadsperblock = (8, 8, 8)
blockspergridx=(size + (threadsperblock[0] - 1)) // threadsperblock[0]
blockspergrid = ((blockspergridx, blockspergridx, blockspergridx))

# Start the kernel 
function[blockspergrid, threadsperblock](output, size, random_array0)
print(output.copy_to_host())

# test if it is consistent with non gpu case
output=np.zeros([2,2,2,5])
for i in range(size):
    for j in range(size):
        for k in range(size):
            a=[i,j+1,k+2]
            for ii in range(len(random_array)):
                output[i,j,k,ii] = a[int(random_array[ii])]
print(output)

我对 cuda.local.array 的用法感到困惑。

它有两个参数。一个是shape，另一个是dtype。

但是，结果不会随着形状的不同设置而改变。例如，shape=0 或 shape =1 或 shape=100.

我不明白这个参数的形状。

有人知道吗？

Answer 1

直接引用自documentation：

Local memory is an area of memory private to each thread. Using local memory helps allocate some scratchpad area when scalar local variables are not enough. The memory is allocated once for the duration of the kernel, unlike traditional dynamic memory management.
numba.cuda.local.array(shape, type) 
Allocate a local array of the given shape and type on the device. shape is either an integer or a tuple of integers representing the array’s dimensions and must be a simple constant expression. type is a Numba type of the elements needing to be stored in the array. The array is private to the current thread. An array-like object is returned which can be read and written to like any standard array (e.g. through indexing).

所以在这种情况下，如果您想要一个至少包含三个元素的本地内存，您必须 shape >= 3 才能使您的代码正确.

您编码似乎与 shape=1 一起工作这一事实应被视为未定义的行为。如果我运行你的代码使用 cuda-memcheck 我得到这个：

$ cuda-memcheck python indexing.py 
========= CUDA-MEMCHECK
[ 1.99261914  1.91166157  2.85454532  1.64078385  1.9576766 ]
========= Invalid __local__ write of size 8
=========     at 0x000001b0 in cudapy::__main__::function1(Array<double, int=4, A, mutable, aligned>, __int64, Array<double, int=1, A, mutable, aligned>)
=========     by thread (1,1,1) in block (0,0,0)
=========     Address 0x00fffc80 is out of bounds

[SNIPPED for brevity]


=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x2cd) [0x23c06d]
Traceback (most recent call last):
  File "indexing.py", line 42, in <module>
    outputd = output.copy_to_host()
  File "/opt/miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py", line 198, in copy_to_host
    _driver.device_to_host(hostary, self, self.alloc_size, stream=stream)
  File "/opt/miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 1481, in device_to_host
    fn(host_pointer(dst), device_pointer(src), size, *varargs)
  File "/opt/miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 259, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/opt/miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 296, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
========= ERROR SUMMARY: 9 errors

即运行如您所料，不正确的本地数组大小会产生内存访问错误。但是，代码实际上仍然是运行s。另一方面，如果我修改您的代码以使用 shape=3:

$ cuda-memcheck python indexing.py 
========= CUDA-MEMCHECK
[ 1.98532356  1.53822652  0.69376061  2.22448278  0.76800584]
True
========= ERROR SUMMARY: 0 errors

内存访问错误消失。所以你不应该混淆正常工作和未定义的行为（这可能包括意外工作，但抛出错误，就像在这种情况下）。发生这种情况的确切原因将隐藏在 numba 运行时间及其编译器生成的代码中。我没有兴趣详细查看以进一步解释。

cuda.local.array 在 numba 中的正确用法是什么？

What is the correct usage of cuda.local.array in numba?

python

cuda

numba