opencl中的并行和减少实现
parallel sum reduction implementation in opencl
我正在查看 link
提供的 NVIDIA 示例代码
在示例内核代码(文件 oclReduction_kernel.c
)中,reduce4 使用
的技术
1) 展开并移除线程 ID < 32 的同步屏障。
2) 除此之外,代码使用 blockSize 检查对本地内存中的数据求和。我认为在 OpenCL 中我们有 get_local_size(0/1)
来了解工作组的大小。块大小让我感到困惑。
以上两点我都无法理解。这些东西为什么以及如何帮助优化?关于 reduce5 和 reduce6 的任何解释也会有所帮助。
您在 https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf 的幻灯片 21 和 22 中对此进行了相当多的解释,@Marco13 在评论中提供了链接。
- As reduction proceeds, # “active” threads decreases
-
- When s <= 32, we have only one warp left
- Instructions are SIMD synchronous within a warp.
- That means when s <= 32:
-
- We don’t need to __syncthreads()
-
- We don’t need “if (tid < s)” because it doesn’t save any work
Without unrolling, all warps execute every iteration of the for loop
and if statement
然后 https://www.pgroup.com/lit/articles/insider/v2n1a5.htm:
The code is actually executed in groups of 32 threads, what NVIDIA
calls a warp.
Each core can execute a sequential thread, but the cores execute in
what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion;
all cores in the same group execute the same instruction at the same
time, much like classical SIMD processors.
回复 2) blockSize
工作组的规模似乎很大。
我正在查看 link
提供的 NVIDIA 示例代码在示例内核代码(文件 oclReduction_kernel.c
)中,reduce4 使用
1) 展开并移除线程 ID < 32 的同步屏障。
2) 除此之外,代码使用 blockSize 检查对本地内存中的数据求和。我认为在 OpenCL 中我们有 get_local_size(0/1)
来了解工作组的大小。块大小让我感到困惑。
以上两点我都无法理解。这些东西为什么以及如何帮助优化?关于 reduce5 和 reduce6 的任何解释也会有所帮助。
您在 https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf 的幻灯片 21 和 22 中对此进行了相当多的解释,@Marco13 在评论中提供了链接。
- As reduction proceeds, # “active” threads decreases
- When s <= 32, we have only one warp left
- Instructions are SIMD synchronous within a warp.
- That means when s <= 32:
- We don’t need to __syncthreads()
- We don’t need “if (tid < s)” because it doesn’t save any work
Without unrolling, all warps execute every iteration of the for loop and if statement
然后 https://www.pgroup.com/lit/articles/insider/v2n1a5.htm:
The code is actually executed in groups of 32 threads, what NVIDIA calls a warp.
Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors.
回复 2) blockSize
工作组的规模似乎很大。