如何仅通过查看内核代码来检测 L0、L1、L2 Cache 可能溢出？

Question

我有一个 RX 570，这些是我从 clGetDeviceInfo

收到的信息

MaxComputeUnitPerGPU: 32

MaxWorkGroupSize: 256

MaxWorkItemSize: 256

MaxGlobalMemoryOfDevice: 4294967296

MaxPrivateMemoryBytesPerWorkGroup: 16384

MaxLocalMemoryBytesPerWorkGroup: 32768

如果我有 256 个工作组，每个工作组有 256 个工作项，这意味着

64 Bytes Of Private(l1?) Memory per work Item(16384/256)
32768 Bytes Of Local(l2) Memory per work Group

如果我使用 17 floats 它会溢出到 L2 吗？

或

如果我使用 15 float 和 2 private float 会溢出到L2吗？

也就是float等同于private float?答：默认相同，@doqtor

或

如果我使用 16 float 并使用 pow、sqrt 和 clamp 注册表(l1?) 会溢出吗？

Answer 1

没有地址说明符的变量默认是私有的。通过 OpenCL docs:

Variables inside a __kernel function not declared with an address space qualifier, all variables inside non-kernel functions, and all function arguments are in the __private or private address space. Variables declared as pointers are considered to point to the __private address space if an address space qualifier is not specified.

私有变量存储在 GPU 的寄存器中。如果内核使用的寄存器多于可用寄存器，一些变量将存储在全局内存中（寄存器溢出）。

Answer 2

要添加到 doqtor 的答案中，如果您处于带宽限制中，则可以通过进行屋顶线分析来检测寄存器溢出。您可以从程序二进制文件 (string binaries = program.getInfo<CL_PROGRAM_BINARIES>()[0]);) 中计算 FLOP 和内存传输的数量。如果您非常接近带宽限制，则不会发生溢出。如果从这一点开始增加私有变量的数量，例如在私有内存中进行矩阵乘法，并且性能显着下降，那么就会出现寄存器溢出：private 变量突然从 global 内存中读取并且由于您已经处于带宽限制中，因此额外的全局内存访问会导致速度变慢。

如何仅通过查看内核代码来检测 L0、L1、L2 Cache 可能溢出？

How to Detect L0,L1,L2 Cache Possible overflow just by looking at the kernel Code?

memory

optimization

caching

memory-management

opencl