使用线程和块的 Pycuda 数组索引

Pycuda Array Indexing with Threads & Blocks

我正在尝试编写用于 Pycuda 的 cuda 直方图函数。代码似乎迭代了比我传入的数组大小更多的元素。为了排除 bin 计算中的错误,我创建了一个非常简单的内核,我在其中传递了一个二维数组并将 1 添加到第一个处理的每个元素的直方图桶。我总是得到比我的二维数组中更多的元素。

输出应该是 [size_of_2d_array, 0,0,0]。

我 运行 Ubuntu 15.04,python 2.7.9。当我尝试其他人编写的示例时,它们似乎工作正常。

我做错了什么?

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np

#Make the kernel
histogram_kernel = """
__global__ void kernel_getHist(unsigned int* array,unsigned int size, unsigned int lda, unsigned int* histo, float buckets)
{

    unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
    unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
    unsigned int tid = y + lda * x;


    if(tid<size){
        //unsigned int value = array[tid];

        //int bin = floor(value * buckets);

        atomicAdd(&histo[0],1);
    }
}
"""
mod = SourceModule(histogram_kernel)


#2d array to analyze
a = np.ndarray(shape=(2,2))
a[0,0] = 1
a[0,1] =2 
a[1,0] = 3
a[1,1] = 4


#histogram stuff, passing but not using right now
max_val = 4
num_bins = np.uint32(4)
bin_size = 1 / np.uint32(max_val / num_bins)

#send array to the gpu
a = a.astype(np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)

#send the histogram to the gpu
a_hist = np.ndarray([1,num_bins])
a_hist = a_hist.astype(np.uint32)
a_hist = a_hist * 0
d_hist = cuda.mem_alloc(a_hist.nbytes)
cuda.memcpy_htod(d_hist, a_hist)

#get the function
func = mod.get_function('kernel_getHist')

#get size & lda
a_size = np.uint32(a.size)
a_lda = np.uint32(a.shape[0])

#print size & lda to check
print(a_lda)
print(a_size)

#run function
func(a_gpu, a_size, a_lda,  d_hist, bin_size, block=(16,16,1))

#get histogram back
cuda.memcpy_dtoh(a_hist, d_hist)

#print the histogram
print a_hist
print a

此代码输出以下内容:

2
4
[[6 0 0 0]]
[[ 1.  2.]
 [ 3.  4.]]

但是,它应该输出:

2
4
[[4 0 0 0]]
[[ 1.  2.]
 [ 3.  4.]]

直方图的元素太多,这让我相信我在 tid 和 size 方面做错了。

有什么想法吗?

这里的问题是您没有在内核中计算 tid 的唯一值。如果你在一张纸上做一些简单的算术,你应该得到 blockDim.x = blockDim.y = 16lda = 2:

x   y   tid 
0   0   0
1   0   2
0   1   1
1   1   3
0   2   2
0   3   3
..  ..  ..

注意最后两个是重复索引。这就是为什么你的代码 returns 6,有 6 个线程满足 tid < size for size=4.

你有两种选择来解决这个问题。一种选择是正确计算唯一索引,例如:

unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int gda = blockDim.y * gridDim.y;
unsigned int tid =  y + gda * x;

应该可以。或者,在输入数组的每个维度上应用边界:

unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int tid =  y + lda * x;

if ( (x < M) && (y < N) ) {
    atomicAdd(&histo[0], 1);
}

也可能会起作用。