Numba cuda：为什么一维数组的总和不对？

Question

我正在练习 numba 和 cuda 编程。我试图用 cuda 总结一系列的。总和不正确。我认为一定有一些东西可以在最后正确地同步和收集数据。

 @cuda.jit
def my_kernel(const_array, res_array):

    sbuf = cuda.shared.array(512, float32)

    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    ty = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    pos = tx + ty * bw

    sbuf[tx] = 0

    if pos < const_array.shape[0]:

        sbuf[tx] = const_array[pos] # do the computation

    cuda.syncthreads()
    if cuda.threadIdx.x == 0:
        for i in range(bw):
            res_array[0] += sbuf[i] 


    return


data_size = 10000000
res = numpy.zeros(1, dtype=numpy.float64)
const_array = numpy.ones(data_size, dtype=numpy.int8)

threadsperblock = 512
blockspergrid = math.ceil(data_size / threadsperblock)

my_kernel[blockspergrid, threadsperblock](const_array, res)

print(res)

每次我运行这段代码都会检索不同的值，例如28160.0，当然必须是10m。

并提示？

Answer 1

首先，求和逻辑根本没有意义，而且效率很低。问题是您试图从不同块中的不同线程写入单个内存位置，这会导致竞争条件。您应该使用 cuda.atomic.add 来避免竞争条件。您可以在 CUDA 编程指南中阅读更多内容：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

Answer 2

问题似乎是您没有对整组块求和。您有一个 10000000 的矢量维度和 512 个线程，这意味着您需要对所有块求和 19532 个块。这是通过启动多个内核（主要用于旧设备）或使用原子操作以标准 CUDA 语言实现的。具体来说，您的问题出在这部分代码中：

if pos < const_array.shape[0]:
    sbuf[tx] = const_array[pos] # do the computation    cuda.syncthreads()
if cuda.threadIdx.x == 0:
    for i in range(bw):
        res_array[0] += sbuf[i]

在前两行中，您将数据从全局复制到数组sbuf 的共享内存中。但是，不同块中的所有线程同时尝试将它们的本地数据添加到位于 res_array 的全局内存地址中，这不是顺序的，不同的线程可能只是读取相同的数据两次并给你错误的结果。解决方法是先在共享内存中进行部分求和，再进行原子求和，避免异步读写操作

if cuda.threadIdx.x == 0:
    sum = 0
    for i in range(bw):
        sum  += sbuf[i] 
    cuda.atomic.add(res_array, 0, sum)

这应该可以解决您的问题。

此致。

Numba cuda：为什么一维数组的总和不对？

Numba cuda: why the sum of the 1D array is not right?

python

cuda

numba