opencl中的并行和减少实现

parallel sum reduction implementation in opencl

我正在查看 link

提供的 NVIDIA 示例代码

在示例内核代码(文件 oclReduction_kernel.c)中,reduce4 使用

的技术

1) 展开并移除线程 ID < 32 的同步屏障。

2) 除此之外,代码使用 blockSize 检查对本地内存中的数据求和。我认为在 OpenCL 中我们有 get_local_size(0/1) 来了解工作组的大小。块大小让我感到困惑。

以上两点我都无法理解。这些东西为什么以及如何帮助优化?关于 reduce5 和 reduce6 的任何解释也会有所帮助。

您在 https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf 的幻灯片 21 和 22 中对此进行了相当多的解释,@Marco13 在评论中提供了链接。

  • As reduction proceeds, # “active” threads decreases
    • When s <= 32, we have only one warp left
  • Instructions are SIMD synchronous within a warp.
  • That means when s <= 32:
    • We don’t need to __syncthreads()
    • We don’t need “if (tid < s)” because it doesn’t save any work

Without unrolling, all warps execute every iteration of the for loop and if statement

然后 https://www.pgroup.com/lit/articles/insider/v2n1a5.htm:

The code is actually executed in groups of 32 threads, what NVIDIA calls a warp.

Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors.

回复 2) blockSize 工作组的规模似乎很大。