如何将CUDA内核函数中的内核输入数据结构与pycuda中的参数输入相关联

Question

我正在编写一个 cuda 内核来将 rgba 图像转换为 pycuda 中的灰度图像，这是 PyCUDA 代码：

import numpy as np
import matplotlib.pyplot as plt
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
kernel = SourceModule("""
#include <stdio.h>
__global__ void rgba_to_greyscale(const uchar4* const rgbaImage,
                   unsigned char* const greyImage,
                   int numRows, int numCols)
{
  int y = threadIdx.y+ blockIdx.y* blockDim.y;
  int x = threadIdx.x+ blockIdx.x* blockDim.x;
  if (y < numCols && x < numRows) {
    int index = numRows*y +x;
    uchar4 color = rgbaImage[index];
    unsigned char grey = (unsigned char)(0.299f*color.x+ 0.587f*color.y + 
    0.114f*color.z);
    greyImage[index] = grey;
 }
}
""")

但是，问题是如何将 uchar4* 关联到 numpy 数组。我知道可以修改我的内核函数以接受 int* 或 float*，并使其工作。但是我只是想知道如何让上面的内核函数在pycuda中工作。

下面是主机代码。

def gpu_rgb2gray(image):
    shape = image.shape
    n_rows, n_cols, _ = np.array(shape, dtype=np.int)
    image_gray = np.empty((n_rows, n_cols), dtype= np.int)
    ## HERE is confusing part, how to rearrange image to match unchar4* ??
    image = image.reshape(1, -1, 4)
    # Get kernel function
    rgba2gray = kernel.get_function("rgba_to_greyscale")
    # Define block, grid and compute
    blockDim = (32, 32, 1) # 1024 threads in total
    dx, mx = divmod(shape[1], blockDim[0])
    dy, my = divmod(shape[0], blockDim[1])
    gridDim = ((dx + (mx>0)), (dy + (my>0)), 1)
    # Kernel function
    # HERE doesn't work because of mismatch
    rgba2gray (
        cuda.In(image), cuda.Out(image_gray), n_rows, n_cols,
        block=blockDim, grid=gridDim)
    return image_gray

有人有什么想法吗？谢谢！

Answer 1

gpuarray class 原生支持 CUDA 的内置向量类型（包括 uchar4）。

因此您可以使用正确的内核 dtype 创建 gpuarray 实例，并使用缓冲区将主机映像复制到该 gpuarray，然后使用 gpuarray 作为内核输入参数。作为一个例子（如果我理解你的代码正确的话），像这样的东西应该可以工作：

import pycuda.gpuarray as gpuarray

....

def gpu_rgb2gray(image):
    shape = image.shape
    image_rgb = gpuarray.empty(shape, dtype=gpuarray.vec.uchar4)
    cuda.memcpy_htod(image_rgb.gpudata, image.data)
    image_gray = gpuarray.empty(shape, dtype=np.uint8)

    # Get kernel function
    rgba2gray = kernel.get_function("rgba_to_greyscale")
    # Define block, grid and compute
    blockDim = (32, 32, 1) # 1024 threads in total
    dx, mx = divmod(shape[1], blockDim[0])
    dy, my = divmod(shape[0], blockDim[1])
    gridDim = ((dx + (mx>0)), (dy + (my>0)), 1)
    rgba2gray ( image_rgb, image_gray, np.int32(shape[0]), np.int32(shape[1]), block=blockDim, grid=gridDim)

    img_gray = np.array(image_gray.get(), dtype=np.int)

    return img_gray

这将获取 32 位无符号整数的图像并将它们复制到 GPU 上的 uchar4 数组，然后将生成的 uchar 数组向上转换回设备上的整数。

如何将CUDA内核函数中的内核输入数据结构与pycuda中的参数输入相关联

How to relate kernel input data structure in CUDA kernel function with parameter input in pycuda

cuda

gpu

numpy

pycuda