使用线程和块的 Pycuda 数组索引
Pycuda Array Indexing with Threads & Blocks
我正在尝试编写用于 Pycuda 的 cuda 直方图函数。代码似乎迭代了比我传入的数组大小更多的元素。为了排除 bin 计算中的错误,我创建了一个非常简单的内核,我在其中传递了一个二维数组并将 1 添加到第一个处理的每个元素的直方图桶。我总是得到比我的二维数组中更多的元素。
输出应该是 [size_of_2d_array, 0,0,0]。
我 运行 Ubuntu 15.04,python 2.7.9。当我尝试其他人编写的示例时,它们似乎工作正常。
我做错了什么?
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
#Make the kernel
histogram_kernel = """
__global__ void kernel_getHist(unsigned int* array,unsigned int size, unsigned int lda, unsigned int* histo, float buckets)
{
unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int tid = y + lda * x;
if(tid<size){
//unsigned int value = array[tid];
//int bin = floor(value * buckets);
atomicAdd(&histo[0],1);
}
}
"""
mod = SourceModule(histogram_kernel)
#2d array to analyze
a = np.ndarray(shape=(2,2))
a[0,0] = 1
a[0,1] =2
a[1,0] = 3
a[1,1] = 4
#histogram stuff, passing but not using right now
max_val = 4
num_bins = np.uint32(4)
bin_size = 1 / np.uint32(max_val / num_bins)
#send array to the gpu
a = a.astype(np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
#send the histogram to the gpu
a_hist = np.ndarray([1,num_bins])
a_hist = a_hist.astype(np.uint32)
a_hist = a_hist * 0
d_hist = cuda.mem_alloc(a_hist.nbytes)
cuda.memcpy_htod(d_hist, a_hist)
#get the function
func = mod.get_function('kernel_getHist')
#get size & lda
a_size = np.uint32(a.size)
a_lda = np.uint32(a.shape[0])
#print size & lda to check
print(a_lda)
print(a_size)
#run function
func(a_gpu, a_size, a_lda, d_hist, bin_size, block=(16,16,1))
#get histogram back
cuda.memcpy_dtoh(a_hist, d_hist)
#print the histogram
print a_hist
print a
此代码输出以下内容:
2
4
[[6 0 0 0]]
[[ 1. 2.]
[ 3. 4.]]
但是,它应该输出:
2
4
[[4 0 0 0]]
[[ 1. 2.]
[ 3. 4.]]
直方图的元素太多,这让我相信我在 tid 和 size 方面做错了。
有什么想法吗?
这里的问题是您没有在内核中计算 tid
的唯一值。如果你在一张纸上做一些简单的算术,你应该得到 blockDim.x = blockDim.y = 16
和 lda = 2
:
x y tid
0 0 0
1 0 2
0 1 1
1 1 3
0 2 2
0 3 3
.. .. ..
注意最后两个是重复索引。这就是为什么你的代码 returns 6,有 6 个线程满足 tid < size
for size=4
.
你有两种选择来解决这个问题。一种选择是正确计算唯一索引,例如:
unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int gda = blockDim.y * gridDim.y;
unsigned int tid = y + gda * x;
应该可以。或者,在输入数组的每个维度上应用边界:
unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int tid = y + lda * x;
if ( (x < M) && (y < N) ) {
atomicAdd(&histo[0], 1);
}
也可能会起作用。
我正在尝试编写用于 Pycuda 的 cuda 直方图函数。代码似乎迭代了比我传入的数组大小更多的元素。为了排除 bin 计算中的错误,我创建了一个非常简单的内核,我在其中传递了一个二维数组并将 1 添加到第一个处理的每个元素的直方图桶。我总是得到比我的二维数组中更多的元素。
输出应该是 [size_of_2d_array, 0,0,0]。
我 运行 Ubuntu 15.04,python 2.7.9。当我尝试其他人编写的示例时,它们似乎工作正常。
我做错了什么?
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
#Make the kernel
histogram_kernel = """
__global__ void kernel_getHist(unsigned int* array,unsigned int size, unsigned int lda, unsigned int* histo, float buckets)
{
unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int tid = y + lda * x;
if(tid<size){
//unsigned int value = array[tid];
//int bin = floor(value * buckets);
atomicAdd(&histo[0],1);
}
}
"""
mod = SourceModule(histogram_kernel)
#2d array to analyze
a = np.ndarray(shape=(2,2))
a[0,0] = 1
a[0,1] =2
a[1,0] = 3
a[1,1] = 4
#histogram stuff, passing but not using right now
max_val = 4
num_bins = np.uint32(4)
bin_size = 1 / np.uint32(max_val / num_bins)
#send array to the gpu
a = a.astype(np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
#send the histogram to the gpu
a_hist = np.ndarray([1,num_bins])
a_hist = a_hist.astype(np.uint32)
a_hist = a_hist * 0
d_hist = cuda.mem_alloc(a_hist.nbytes)
cuda.memcpy_htod(d_hist, a_hist)
#get the function
func = mod.get_function('kernel_getHist')
#get size & lda
a_size = np.uint32(a.size)
a_lda = np.uint32(a.shape[0])
#print size & lda to check
print(a_lda)
print(a_size)
#run function
func(a_gpu, a_size, a_lda, d_hist, bin_size, block=(16,16,1))
#get histogram back
cuda.memcpy_dtoh(a_hist, d_hist)
#print the histogram
print a_hist
print a
此代码输出以下内容:
2
4
[[6 0 0 0]]
[[ 1. 2.]
[ 3. 4.]]
但是,它应该输出:
2
4
[[4 0 0 0]]
[[ 1. 2.]
[ 3. 4.]]
直方图的元素太多,这让我相信我在 tid 和 size 方面做错了。
有什么想法吗?
这里的问题是您没有在内核中计算 tid
的唯一值。如果你在一张纸上做一些简单的算术,你应该得到 blockDim.x = blockDim.y = 16
和 lda = 2
:
x y tid
0 0 0
1 0 2
0 1 1
1 1 3
0 2 2
0 3 3
.. .. ..
注意最后两个是重复索引。这就是为什么你的代码 returns 6,有 6 个线程满足 tid < size
for size=4
.
你有两种选择来解决这个问题。一种选择是正确计算唯一索引,例如:
unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int gda = blockDim.y * gridDim.y;
unsigned int tid = y + gda * x;
应该可以。或者,在输入数组的每个维度上应用边界:
unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int tid = y + lda * x;
if ( (x < M) && (y < N) ) {
atomicAdd(&histo[0], 1);
}
也可能会起作用。