使用 MATLAB CUDAKernel 使 imageDenoising CUDA 示例运行

Question

TL;DR

我正在寻找一种方法来提取现有 CUDA 工具包示例的一部分并将其转换为 MATLAB 中的 CUDAKernel 可执行文件。

故事

在尝试获得非局部均值 (NLM) 2D 过滤器的短运行时实现时，我偶然发现 imageDenoising example provided with the CUDA Toolkit 它实现了该过滤器的两个变体，称为 NLM 和 NLM2（或"quick NLM").

之前没有 CUDA 编码经验，我最初尝试遵循 MATLAB's documentation on the subject, which resulted in several strange errors including: ptx compilation, multiple entry points and wrong number of inputs in the C prototype。在这一点上，我意识到这不会成为 "just works" 案例，需要进行一些修补。

所以我决定通过简单地删除部分 imageDenoising.cu 文件并将相关的 .cuh（..._nlm_kernel.cuh 或 ..._nlm2_kernel.cuh）合并到.cu 以便在任何给定时间获得单个入口点。令我惊讶的是，这实际上成功编译了，我终于能够创建一个 CUDAKernel 而没有错误（使用命令 k = parallel.gpu.CUDAKernel('imageDenoising.ptx', 'uint8_T *, int, int, float, float');）。

但这还不够，因为我错误地认为第一个参数是 RGB 矩阵形式的未处理图像（即 X*Y*3 uint8），所以结果是返回正是输入，但在前 4 个元素中有 0。

进一步搜索后，我意识到对于这样的转换过程还有一些我完全不知道（例如 the need to initialize __device__ variables）的其他关键方面，在这个阶段我决定寻求帮助.

问题

我目前想知道如何有效地从这里继续。虽然我很想知道这种方法是否通常可以取得成果（以及是否可以在某处获得此过程的完整示例），但我还应该注意哪些其他陷阱，以及我可以采取哪些替代行动方案（考虑我非常对 CUDA 的了解有限，事实上我不会雇用任何其他人为我做这件事），我记得这是 SO 所以我必须有一个 特定的编程问题，所以这里是：

How do I modify imageDenoising.cu such that the MATLAB CUDAKernel constructed from it will also accept the unprocessed image as an input?

注意：在我的应用程序中，输入矩阵是二维灰度 double 矩阵。

相关： How CudaMalloc work?

P.S.

一段有效的代码显然会受到欢迎，但我真的更愿意 "learn to fish"。

Answer 1

我最终采用了另一种 CUDAKernel 方法，使用 .MEX，方法是执行以下操作：

设置外部库OpenCV v2.4.10 (not v3!) and mexopencv。
正在为 OpenCV's fastNlMeansDenoising using the guidelines of mexopencv for unimplemented functions 编写一个小包装函数，如下所示（不包括文档）：

#include "mexopencv.hpp"
using namespace cv;

void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, const mxArray *prhs[])
{
    // Check arguments
    if (nlhs != 1 || nrhs<1 || ((nrhs % 2) != 1) )
        mexErrMsgIdAndTxt("fastNLM:invalidArgs", "Wrong number of arguments");

    // Argument vector  
    vector<MxArray> rhs(prhs, prhs + nrhs);

    // Option processing
      // Defaults:
    double h = 3;
    int templateWindowSize = 7;
    int searchWindowSize = 21;
      // Parsing input name-value pairs:
    for (int i = 1; i<nrhs; i += 2) {
        string key = rhs[i].toString();
        if (key == "h")
            h = rhs[i + 1].toDouble();
        else if (key == "templateWindowSize")
            templateWindowSize = rhs[i + 1].toInt();
        else if (key == "searchWindowSize")
            searchWindowSize = rhs[i + 1].toInt();
        else
            mexErrMsgIdAndTxt("mexopencv:error", "Unrecognized option");
    }

    // Process
    Mat src(rhs[0].toMat()), dst;
    fastNlMeansDenoising(src, dst, h, templateWindowSize, searchWindowSize);

    // Convert cv::Mat back to mxArray*
    plhs[0] = MxArray(dst);
}

编译它......和 viola - 一个工作的 CUDA 加速 NLM 过滤器。

我的问题本身的答案可以通过比较opencv\sources\modules\photo\src\cuda\nlm.cu (this is the opencv2 path) with imageDenoising_nlm2_kernel.cuh找到。

这个解决方案对我来说效果很好，因为对我来说获得 NLM 过滤器运行比使用 CUDAKernel 更重要。

我从中学到的主要教训（我想传授给其他人）是：

Running CUDA code in MATLAB can also be done in ways other than CUDAKernel, such as using .mex wrappers as shown above.

使用 MATLAB CUDAKernel 使 imageDenoising CUDA 示例运行

Getting the imageDenoising CUDA example to work using a MATLAB CUDAKernel

matlab

filtering

cuda

gpgpu

image-processing

TL;DR

故事

问题