如何在 Metal 中制作模板化计算内核

Question

我一直在编写一些 Metal 计算内核。所以，我写了一个带有以下声明的内核：

kernel void
myKernel(const device uint32_t *inData [[buffer(MyKernelIn)]],
        device uint32_t *outData [[buffer(MyKernelOut)]],
        uint2                          gid       [[thread_position_in_grid]],
        uint2 thread_position_in_threadgroup         [[thread_position_in_threadgroup]],
        uint2       threads_per_threadgroup      [[threads_per_threadgroup]],
        uint2 threadgroup_position_in_grid       [[threadgroup_position_in_grid]]) 
{ }

现在，我想编写一个采用 inData 类型 uint8_t 和 float 的变体，我该怎么做？

我能想到的可能方法：

使用不同的名称复制我的内核。（不可扩展）
传递一些标志，我可以根据它向我的内核添加 switch case，我可以随时使用它，reading/writing inData 和 outData 中的任何内存位置。这意味着我创建的任何临时数据也将使用此类逻辑进行转换。（这会再次在内核代码中引起很多间接寻址，不确定它会如何影响我的性能）

有更好的方法吗？我看到 Metal Performance Shaders 在 MTLTexture 上工作，它指定 pixelFormat，基于 pixelFormat，MPS 可以处理大范围的数据类型。关于如何完成的任何见解？

谢谢！

Answer 1

一种可能有效的方法是：

将 inData 声明为 void*
在内核着色器的主体中，调用模板函数，传递参数。模板函数将由所需类型模板化，并将接收 inData 作为指向该类型的指针。

您可以使用输入参数来动态选择要调用的模板函数的变体。但更好的方法可能是使用一个函数常量来挑选。这样，选择就被编译了。

所以，类似于：

constant int variant [[function_constant(0)]];

template<typename T> void
work(const device void *inData,
     device uint32_t *outData,
     uint2 gid,
     uint2 thread_position_in_threadgroup,
     uint2 threads_per_threadgroup,
     uint2 threadgroup_position_in_grid) 
{
    const device T *data = static_cast<const device T*>(inData);
    // ...
}

kernel void
myKernel(const device void *inData              [[buffer(MyKernelIn)]],
         device uint32_t *outData               [[buffer(MyKernelOut)]],
         uint2 gid                              [[thread_position_in_grid]],
         uint2 thread_position_in_threadgroup   [[thread_position_in_threadgroup]],
         uint2 threads_per_threadgroup          [[threads_per_threadgroup]],
         uint2 threadgroup_position_in_grid     [[threadgroup_position_in_grid]]) 
{
    if (variant == 0)
        work<uint32_t>(inData, outData, gid, thread_position_in_threadgroup,
                       threads_per_threadgroup, threadgroup_position_in_grid);
    else if (variant == 1)
        work<uint8_t>(inData, outData, gid, thread_position_in_threadgroup,
                      threads_per_threadgroup, threadgroup_position_in_grid);
    else
        work<float>(inData, outData, gid, thread_position_in_threadgroup,
                    threads_per_threadgroup, threadgroup_position_in_grid);
}

如何在 Metal 中制作模板化计算内核

How to make templated Compute Kernels in Metal

macos

gpu

gpgpu

metal