OpenCL

Question

从following post开始，我尝试实现数组的求和归约使用此内核代码：

 #pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

__kernel void sumGPU ( __global const long *input, 
               __global long *finalSum
               )
 {
  uint local_id = get_local_id(0);
  uint group_size = get_local_size(0);

  // Temporary local value
  local long tempInput;

  tempInput = input[local_id];

  // Variable for final sum 
  local long totalSumIntegerPart[1];

  // Initialize sums
  if (local_id==0)
    totalSumIntegerPart[0] = 0;

  // Compute atom_add into each workGroup 
  barrier(CLK_LOCAL_MEM_FENCE);

  atom_add(&totalSumIntegerPart[0], tempInput);

  barrier(CLK_LOCAL_MEM_FENCE);

  // Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

}

但是finalSum的值不是预期值（我最初将input数组设置为：

 for (i=0; i<nWorkItems; i++)
    input[i] = i+1;

所以，我希望 nWorkItems = 1024 : finalSum = nWorkItems*(nWorkItems+1)/2=524800

实际上，我得到 finalSum = 16384。

我通过 sizeWorkGroup = 16 和 nWorkItems = 1024.

得到这个结果

奇怪的是，使用 sizeWorkGroup = 32 和 nWorkItems = 1024，我得到另一个值：finalSum = 32768

我不明白最后一条指令（应该计算每个部分和的总和，即针对每个工作组）：

// Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

的确，我本以为指令 atom_add(finalSum, totalSumIntegerPart[0]); 会独立于 local_id if condition。

最重要的是这条指令必须执行“number of workGroups”次（假设 finalSum 是所有工作组之间的共享值，不是吗？）。

所以我想我可以替换 :

// Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

来自

 // Perform sum of each workGroup sum
      if (local_id==0)
        atom_add(finalSum, totalSumIntegerPart[0]);

任何人都可以用我的参数（sizeWorkGroup = 16 和 nWorkItems = 1024）帮助找到正确的值，即 finalSum 等于 524800 ?

或者向我解释为什么最后的总和表现不佳？

更新：

这是 following link 上的内核代码（它与我的略有不同，因为 atom_add 这里只为每个工作项增加 1）：

kernel void AtomicSum(global int* sum)

{
 local int tmpSum[1]; 
 if(get_local_id(0)==0){
 tmpSum[0]=0;}

barrier(CLK_LOCAL_MEM_FENCE);         
atomic_add(&tmpSum[0],1);         
barrier(CLK_LOCAL_MEM_FENCE);    

if(get_local_id(0)==(get_local_size(0)-1)){
  atomic_add(sum,tmpSum[0]);
 }

}

我的意思是，这是一个有效的内核代码吗？它提供了良好的结果？

也许一个解决方案是放在我的内核代码的开头：

unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// load one tile into local memory
int idx = i * localSize + tid;
localInput[tid] = input[idx];

我将对其进行测试并随时通知您。

谢谢

Answer 1

这一行是错误的：

tempInput = input[local_id];

应该是：

tempInput = input[get_global_id(0)];

你总是对输入的第一个区域求和，这与你奇怪的结果一致。以及为什么它取决于工作组大小的参数。

16*16*64 = 16384
32*32*32 = 32768

你的代码也可以稍微简化一下：

  uint local_id = get_local_id(0);

  // Variable for final sum 
  local long totalSumIntegerPart;

  // Initialize sums
  if (local_id==0)
    totalSumIntegerPart = 0;

  // Compute atom_add into each workGroup 
  barrier(CLK_LOCAL_MEM_FENCE);    
  atom_add(&totalSumIntegerPart, input[get_global_id(0)]);    
  barrier(CLK_LOCAL_MEM_FENCE);

  // Perform sum of each workGroup sum
  if (local_id==0)
    atom_add(finalSum, totalSumIntegerPart);

而且我不会像您那样滥用原子，因为它们不是进行归约的最有效方式。使用适当的缩减方法，您可能会获得 10 倍以上的速度。但是，作为 PoC 或学习本地内存和 CL 是可以的。

OpenCL - 执行减少的方法

OpenCL - Method to perform a reduction

c

synchronization

atomic

reduction