iOS Metal：读取只读数据的最快方式？

Question

情况： 在 Metal 内核函数中，线程组中的每个线程一次读取完全相同的值。内核伪代码：

kernel void foo(device   int2*   ranges,  
                constant float3& readonlyBuffer,  
                device   float*  results,  
                uint lno [[ threadgroup_position_in_grid ]])  
{  
  float acc = 0.0;  

  for(int i=ranges[lno].x; i<ranges[lno].y; i++) {  
    // each thread in threadgroup processes the same value from the buffer  
    acc += process( readonlyBuffer[i] );  
  }  

  results[...] = acc;  
}

问题： 为了优化缓冲区读取，我将 readonlyBuffer 的地址 space 限定符从 device 更改为 constant。这对内核性能的影响为零，尽管 Apple documentation 表示不同：

The constant address space is optimized for multiple instances executing a graphics or kernel function accessing the same location in the buffer.

问题：

如何提高常量缓冲区的内存读取时间？
我可以将缓冲区（或至少其中的一部分）移动到片上缓存（类似于 Constant Buffer Preloading（第 24 页））吗？

Answer 1

在您的示例代码中，索引到 readonlyBuffer 会产生编译器错误。

假设readonlyBuffer被声明为一个指针，那么编译器并不知道静态大小，也不能将数据移动到常量内存space.

如果readonlyBuffer很小（你只有4KB的常量内存可以使用），把它放到一个结构中，如下所示：

struct ReadonlyBuffer {
    float3 values[MAX_BUFFER_SIZE];
};

然后做：

kernel void foo(device   int2*   ranges,  
                constant ReadonlyBuffer& readonlyBuffer,  
                device   float*  results,  
                uint lno [[ threadgroup_position_in_grid ]])

最后，运行 GPU 跟踪 ("Capture GPU Frame") 并确保您不会收到以下错误：

The Compiler was not able to Preload your Buffer. Kernel Function, Buffer Index: 1.

有关缓冲区预加载的详细信息，请参阅：https://developer.apple.com/videos/play/wwdc2016/606/?time=408

iOS Metal：读取只读数据的最快方式？

iOS Metal: The fastest way to read read-only data?

compute-shader

ios

metal

ios10