如何将CUDA内核函数中的内核输入数据结构与pycuda中的参数输入相关联
How to relate kernel input data structure in CUDA kernel function with parameter input in pycuda
我正在编写一个 cuda 内核来将 rgba 图像转换为 pycuda 中的灰度图像,这是 PyCUDA 代码:
import numpy as np
import matplotlib.pyplot as plt
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
kernel = SourceModule("""
#include <stdio.h>
__global__ void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int y = threadIdx.y+ blockIdx.y* blockDim.y;
int x = threadIdx.x+ blockIdx.x* blockDim.x;
if (y < numCols && x < numRows) {
int index = numRows*y +x;
uchar4 color = rgbaImage[index];
unsigned char grey = (unsigned char)(0.299f*color.x+ 0.587f*color.y +
0.114f*color.z);
greyImage[index] = grey;
}
}
""")
但是,问题是如何将 uchar4* 关联到 numpy 数组。我知道可以修改我的内核函数以接受 int* 或 float*,并使其工作。但是我只是想知道如何让上面的内核函数在pycuda中工作。
下面是主机代码。
def gpu_rgb2gray(image):
shape = image.shape
n_rows, n_cols, _ = np.array(shape, dtype=np.int)
image_gray = np.empty((n_rows, n_cols), dtype= np.int)
## HERE is confusing part, how to rearrange image to match unchar4* ??
image = image.reshape(1, -1, 4)
# Get kernel function
rgba2gray = kernel.get_function("rgba_to_greyscale")
# Define block, grid and compute
blockDim = (32, 32, 1) # 1024 threads in total
dx, mx = divmod(shape[1], blockDim[0])
dy, my = divmod(shape[0], blockDim[1])
gridDim = ((dx + (mx>0)), (dy + (my>0)), 1)
# Kernel function
# HERE doesn't work because of mismatch
rgba2gray (
cuda.In(image), cuda.Out(image_gray), n_rows, n_cols,
block=blockDim, grid=gridDim)
return image_gray
有人有什么想法吗?谢谢!
gpuarray
class 原生支持 CUDA 的内置向量类型(包括 uchar4
)。
因此您可以使用正确的内核 dtype 创建 gpuarray 实例,并使用缓冲区将主机映像复制到该 gpuarray,然后使用 gpuarray 作为内核输入参数。作为一个例子(如果我理解你的代码正确的话),像这样的东西应该可以工作:
import pycuda.gpuarray as gpuarray
....
def gpu_rgb2gray(image):
shape = image.shape
image_rgb = gpuarray.empty(shape, dtype=gpuarray.vec.uchar4)
cuda.memcpy_htod(image_rgb.gpudata, image.data)
image_gray = gpuarray.empty(shape, dtype=np.uint8)
# Get kernel function
rgba2gray = kernel.get_function("rgba_to_greyscale")
# Define block, grid and compute
blockDim = (32, 32, 1) # 1024 threads in total
dx, mx = divmod(shape[1], blockDim[0])
dy, my = divmod(shape[0], blockDim[1])
gridDim = ((dx + (mx>0)), (dy + (my>0)), 1)
rgba2gray ( image_rgb, image_gray, np.int32(shape[0]), np.int32(shape[1]), block=blockDim, grid=gridDim)
img_gray = np.array(image_gray.get(), dtype=np.int)
return img_gray
这将获取 32 位无符号整数的图像并将它们复制到 GPU 上的 uchar4
数组,然后将生成的 uchar
数组向上转换回设备上的整数。
我正在编写一个 cuda 内核来将 rgba 图像转换为 pycuda 中的灰度图像,这是 PyCUDA 代码:
import numpy as np
import matplotlib.pyplot as plt
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
kernel = SourceModule("""
#include <stdio.h>
__global__ void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int y = threadIdx.y+ blockIdx.y* blockDim.y;
int x = threadIdx.x+ blockIdx.x* blockDim.x;
if (y < numCols && x < numRows) {
int index = numRows*y +x;
uchar4 color = rgbaImage[index];
unsigned char grey = (unsigned char)(0.299f*color.x+ 0.587f*color.y +
0.114f*color.z);
greyImage[index] = grey;
}
}
""")
但是,问题是如何将 uchar4* 关联到 numpy 数组。我知道可以修改我的内核函数以接受 int* 或 float*,并使其工作。但是我只是想知道如何让上面的内核函数在pycuda中工作。
下面是主机代码。
def gpu_rgb2gray(image):
shape = image.shape
n_rows, n_cols, _ = np.array(shape, dtype=np.int)
image_gray = np.empty((n_rows, n_cols), dtype= np.int)
## HERE is confusing part, how to rearrange image to match unchar4* ??
image = image.reshape(1, -1, 4)
# Get kernel function
rgba2gray = kernel.get_function("rgba_to_greyscale")
# Define block, grid and compute
blockDim = (32, 32, 1) # 1024 threads in total
dx, mx = divmod(shape[1], blockDim[0])
dy, my = divmod(shape[0], blockDim[1])
gridDim = ((dx + (mx>0)), (dy + (my>0)), 1)
# Kernel function
# HERE doesn't work because of mismatch
rgba2gray (
cuda.In(image), cuda.Out(image_gray), n_rows, n_cols,
block=blockDim, grid=gridDim)
return image_gray
有人有什么想法吗?谢谢!
gpuarray
class 原生支持 CUDA 的内置向量类型(包括 uchar4
)。
因此您可以使用正确的内核 dtype 创建 gpuarray 实例,并使用缓冲区将主机映像复制到该 gpuarray,然后使用 gpuarray 作为内核输入参数。作为一个例子(如果我理解你的代码正确的话),像这样的东西应该可以工作:
import pycuda.gpuarray as gpuarray
....
def gpu_rgb2gray(image):
shape = image.shape
image_rgb = gpuarray.empty(shape, dtype=gpuarray.vec.uchar4)
cuda.memcpy_htod(image_rgb.gpudata, image.data)
image_gray = gpuarray.empty(shape, dtype=np.uint8)
# Get kernel function
rgba2gray = kernel.get_function("rgba_to_greyscale")
# Define block, grid and compute
blockDim = (32, 32, 1) # 1024 threads in total
dx, mx = divmod(shape[1], blockDim[0])
dy, my = divmod(shape[0], blockDim[1])
gridDim = ((dx + (mx>0)), (dy + (my>0)), 1)
rgba2gray ( image_rgb, image_gray, np.int32(shape[0]), np.int32(shape[1]), block=blockDim, grid=gridDim)
img_gray = np.array(image_gray.get(), dtype=np.int)
return img_gray
这将获取 32 位无符号整数的图像并将它们复制到 GPU 上的 uchar4
数组,然后将生成的 uchar
数组向上转换回设备上的整数。