opencl中的并行和减少实现

parallel sum reduction implementation in opencl

gpgpu
opencl

我正在查看 link

提供的 NVIDIA 示例代码

在示例内核代码（文件 oclReduction_kernel.c）中，reduce4 使用

的技术

1) 展开并移除线程 ID < 32 的同步屏障。

2) 除此之外，代码使用 blockSize 检查对本地内存中的数据求和。我认为在 OpenCL 中我们有 get_local_size(0/1) 来了解工作组的大小。块大小让我感到困惑。

以上两点我都无法理解。这些东西为什么以及如何帮助优化？关于 reduce5 和 reduce6 的任何解释也会有所帮助。

您在 https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf 的幻灯片 21 和 22 中对此进行了相当多的解释，@Marco13 在评论中提供了链接。

As reduction proceeds, # “active” threads decreases

When s <= 32, we have only one warp left

Instructions are SIMD synchronous within a warp.

That means when s <= 32:

We don’t need to __syncthreads()

We don’t need “if (tid < s)” because it doesn’t save any work

Without unrolling, all warps execute every iteration of the for loop and if statement

然后 https://www.pgroup.com/lit/articles/insider/v2n1a5.htm:

The code is actually executed in groups of 32 threads, what NVIDIA calls a warp.

Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors.

回复 2) blockSize 工作组的规模似乎很大。

opencl中的并行和减少实现

parallel sum reduction implementation in opencl

gpgpu

opencl