CUDA 索引未按预期工作
CUDA indexing does not work as expected
我正在尝试使用 PyCUDA 处理二维数组,我需要每个线程的 x、y 坐标。
这个问题已被问及回答 here and here,但链接的解决方案不适用于超过我的块大小的 2D 数据。为什么?
这是我用来帮助解决这个问题的 SourceModule:
mod = SourceModule("""
__global__ void kIndexTest(float *M, float *X, float*Y)
{
int bIdx = blockIdx.x + blockIdx.y * gridDim.x;
int idx = bIdx * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;
/* this array shows me the unique thread indices */
M[idx] = idx;
/* these arrays should capture x, y for each unique index */
X[idx] = (blockDim.x * blockIdx.x) + threadIdx.x;
Y[idx] = (blockDim.y * blockIdx.y) + threadIdx.y;
}
""")
我正在这样执行内核:
gIndexTest = mod.get_function("kIndexTest")
dims = (8, 8)
M = gpuarray.to_gpu(numpy.zeros(dims, dtype=numpy.float32))
X = gpuarray.to_gpu(numpy.zeros(dims, dtype=numpy.float32))
Y = gpuarray.to_gpu(numpy.zeros(dims, dtype=numpy.float32))
gIndexTest(M, X, Y, block=(4, 4, 1), grid=(2, 2, 1))
M returns 是我测试过的所有维度和所有 block/grid 配置的正确索引。当 X 和 Y 的尺寸与块尺寸相同时,X 和 Y 仅 return 正确的坐标值,但不要 return 我期望的其他情况。例如,上面的配置产生:
M:
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 8. 9. 10. 11. 12. 13. 14. 15.]
[ 16. 17. 18. 19. 20. 21. 22. 23.]
[ 24. 25. 26. 27. 28. 29. 30. 31.]
[ 32. 33. 34. 35. 36. 37. 38. 39.]
[ 40. 41. 42. 43. 44. 45. 46. 47.]
[ 48. 49. 50. 51. 52. 53. 54. 55.]
[ 56. 57. 58. 59. 60. 61. 62. 63.]] (correct)
X:
[[ 0. 1. 2. 3. 0. 1. 2. 3.]
[ 0. 1. 2. 3. 0. 1. 2. 3.]
[ 4. 5. 6. 7. 4. 5. 6. 7.]
[ 4. 5. 6. 7. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 0. 1. 2. 3.]
[ 0. 1. 2. 3. 0. 1. 2. 3.]
[ 4. 5. 6. 7. 4. 5. 6. 7.]
[ 4. 5. 6. 7. 4. 5. 6. 7.]] (not what I expect)
Y:
[[ 0. 0. 0. 0. 1. 1. 1. 1.]
[ 2. 2. 2. 2. 3. 3. 3. 3.]
[ 0. 0. 0. 0. 1. 1. 1. 1.]
[ 2. 2. 2. 2. 3. 3. 3. 3.]
[ 4. 4. 4. 4. 5. 5. 5. 5.]
[ 6. 6. 6. 6. 7. 7. 7. 7.]
[ 4. 4. 4. 4. 5. 5. 5. 5.]
[ 6. 6. 6. 6. 7. 7. 7. 7.]] (not what I expect)
这是我对 X 和 Y 的实际期望:
X:
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]] (only works when X dims = block dims)
Y:
[[ 0. 0. 0. 0. 0. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]
[ 2. 2. 2. 2. 2. 2. 2. 2.]
[ 3. 3. 3. 3. 3. 3. 3. 3.]
[ 4. 4. 4. 4. 4. 4. 4. 4.]
[ 5. 5. 5. 5. 5. 5. 5. 5.]
[ 6. 6. 6. 6. 6. 6. 6. 6.]
[ 7. 7. 7. 7. 7. 7. 7. 7.]] (only works when Y dims = block dims)
我有什么不明白的?
这是我的设备查询:
Device 0: "GeForce GT 755M"
CUDA Driver Version / Runtime Version 7.5 / 6.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1085 MHz (1.09 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
一切正常"as advertised"。这里的问题是您将不兼容的索引方案混合在一起,这会产生不一致的结果。
如果您希望 X
和 Y
按您的预期出现,您需要以不同的方式计算 idx
:
__global__ void kIndexTest(float *M, float *X, float*Y)
{
int xidx = (blockDim.x * blockIdx.x) + threadIdx.x;
int yidx = (blockDim.y * blockIdx.y) + threadIdx.y;
int idx = (gridDim.x * blockDim.x * yidx) + xidx;
X[idx] = xidx;
Y[idx] = yidx;
M[idx] = idx;
}
在这个方案中,xidx
和yidx
是网格x和y坐标,idx
是全局索引,都假设列主要排序(即x是最快的不同维度)。
我正在尝试使用 PyCUDA 处理二维数组,我需要每个线程的 x、y 坐标。
这个问题已被问及回答 here and here,但链接的解决方案不适用于超过我的块大小的 2D 数据。为什么?
这是我用来帮助解决这个问题的 SourceModule:
mod = SourceModule("""
__global__ void kIndexTest(float *M, float *X, float*Y)
{
int bIdx = blockIdx.x + blockIdx.y * gridDim.x;
int idx = bIdx * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;
/* this array shows me the unique thread indices */
M[idx] = idx;
/* these arrays should capture x, y for each unique index */
X[idx] = (blockDim.x * blockIdx.x) + threadIdx.x;
Y[idx] = (blockDim.y * blockIdx.y) + threadIdx.y;
}
""")
我正在这样执行内核:
gIndexTest = mod.get_function("kIndexTest")
dims = (8, 8)
M = gpuarray.to_gpu(numpy.zeros(dims, dtype=numpy.float32))
X = gpuarray.to_gpu(numpy.zeros(dims, dtype=numpy.float32))
Y = gpuarray.to_gpu(numpy.zeros(dims, dtype=numpy.float32))
gIndexTest(M, X, Y, block=(4, 4, 1), grid=(2, 2, 1))
M returns 是我测试过的所有维度和所有 block/grid 配置的正确索引。当 X 和 Y 的尺寸与块尺寸相同时,X 和 Y 仅 return 正确的坐标值,但不要 return 我期望的其他情况。例如,上面的配置产生:
M:
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 8. 9. 10. 11. 12. 13. 14. 15.]
[ 16. 17. 18. 19. 20. 21. 22. 23.]
[ 24. 25. 26. 27. 28. 29. 30. 31.]
[ 32. 33. 34. 35. 36. 37. 38. 39.]
[ 40. 41. 42. 43. 44. 45. 46. 47.]
[ 48. 49. 50. 51. 52. 53. 54. 55.]
[ 56. 57. 58. 59. 60. 61. 62. 63.]] (correct)
X:
[[ 0. 1. 2. 3. 0. 1. 2. 3.]
[ 0. 1. 2. 3. 0. 1. 2. 3.]
[ 4. 5. 6. 7. 4. 5. 6. 7.]
[ 4. 5. 6. 7. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 0. 1. 2. 3.]
[ 0. 1. 2. 3. 0. 1. 2. 3.]
[ 4. 5. 6. 7. 4. 5. 6. 7.]
[ 4. 5. 6. 7. 4. 5. 6. 7.]] (not what I expect)
Y:
[[ 0. 0. 0. 0. 1. 1. 1. 1.]
[ 2. 2. 2. 2. 3. 3. 3. 3.]
[ 0. 0. 0. 0. 1. 1. 1. 1.]
[ 2. 2. 2. 2. 3. 3. 3. 3.]
[ 4. 4. 4. 4. 5. 5. 5. 5.]
[ 6. 6. 6. 6. 7. 7. 7. 7.]
[ 4. 4. 4. 4. 5. 5. 5. 5.]
[ 6. 6. 6. 6. 7. 7. 7. 7.]] (not what I expect)
这是我对 X 和 Y 的实际期望:
X:
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 0. 1. 2. 3. 4. 5. 6. 7.]] (only works when X dims = block dims)
Y:
[[ 0. 0. 0. 0. 0. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]
[ 2. 2. 2. 2. 2. 2. 2. 2.]
[ 3. 3. 3. 3. 3. 3. 3. 3.]
[ 4. 4. 4. 4. 4. 4. 4. 4.]
[ 5. 5. 5. 5. 5. 5. 5. 5.]
[ 6. 6. 6. 6. 6. 6. 6. 6.]
[ 7. 7. 7. 7. 7. 7. 7. 7.]] (only works when Y dims = block dims)
我有什么不明白的?
这是我的设备查询:
Device 0: "GeForce GT 755M"
CUDA Driver Version / Runtime Version 7.5 / 6.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1085 MHz (1.09 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
一切正常"as advertised"。这里的问题是您将不兼容的索引方案混合在一起,这会产生不一致的结果。
如果您希望 X
和 Y
按您的预期出现,您需要以不同的方式计算 idx
:
__global__ void kIndexTest(float *M, float *X, float*Y)
{
int xidx = (blockDim.x * blockIdx.x) + threadIdx.x;
int yidx = (blockDim.y * blockIdx.y) + threadIdx.y;
int idx = (gridDim.x * blockDim.x * yidx) + xidx;
X[idx] = xidx;
Y[idx] = yidx;
M[idx] = idx;
}
在这个方案中,xidx
和yidx
是网格x和y坐标,idx
是全局索引,都假设列主要排序(即x是最快的不同维度)。