pycuda 中的加等于 (+=) 运算符

Question

我想在 pycuda 中实现一个卷积变体。

为简单起见，我将展示插值的矩形核。标准卷积可以应用如下：

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule

mod = SourceModule("""
#include <stdio.h>
__global__ void func(float *dest, float *a)
{
  const int img_size = 64;
  const int kernel_size = 3;
  const int kernel_size_half = kernel_size/2;
  const int tx = blockIdx.x * blockDim.x + threadIdx.x;
  const int ty = blockIdx.y * blockDim.y + threadIdx.y;
  int tx_kernel;

  tx_kernel = tx - kernel_size_half;
  for (int idx=-kernel_size_half; idx <= kernel_size_half; idx++) 
  {
      tx_kernel = tx + idx ;
      if ((tx_kernel < 0) || (tx_kernel > img_size-1))
        continue;
      dest[ty * img_size + tx] +=  a[ty * img_size + tx_kernel] / ((float) kernel_size);
  }
}
""")

我不想计算邻居的当前位置，而是想做相反的事情，将当前像素的值添加到邻居。

即：

换行：

dest[ty * img_size + tx] +=  a[ty * img_size + tx_kernel] / ((float) kernel_size);

至：

dest[ty * img_size + tx_kernel] +=  a[ty * img_size + tx] / ((float) kernel_size);

然而，第一个工作正常但第二个不行，它通过更新邻居失败。有没有办法绕过它？

注意：我简化了问题以专注于我需要的东西，一般的问题是对每个像素使用不同的卷积核，而不是我在问题中问到的相同的卷积核。

Answer 1

to change the line:

dest[ty * img_size + tx] +=  a[ty * img_size + tx_kernel] / ((float) kernel_size);

to:

dest[ty * img_size + tx_kernel] +=  a[ty * img_size + tx] / ((float) kernel_size);

However, The first works fine but the second is not, it fails by updating the neighbours. Is there a way to bypass it?

从性能的角度来看，第一种方法是首选。但是，如果您希望“更新邻居”，那么应该可以将第二个操作重铸为：

atomicAdd(&(dest[ty * img_size + tx_kernel]), a[ty * img_size + tx] / ((float) kernel_size));

pycuda 中的加等于 (+=) 运算符

plus equal (+=) operator in pycuda

python

parallel-processing

cuda

gpu

pycuda