与本地和共享内存相比，Numba 允许设备分配的常量内存有多快或多慢？

How fast or slow is the Constant memory that Numba allows a device to allocate, when compared to local and shared memories?

关于 Numba 文档中提到的所谓常量内存的性能，我找不到任何清晰的信息：

https://numba.pydata.org/numba-doc/dev/cuda/memory.html#constant-memory

我很好奇此内存的大小限制是多少，fast/slow 与其他内存类型相比如何，使用它是否有任何陷阱。

谢谢！

这是一个关于支持 CUDA 的设备中常量内存的一般性问题。您可以在官方 CUDA 编程指南和 here 中找到信息，其中显示：

There is a total of 64 KB constant memory on a device. The constant memory space is cached. As a result, a read from constant memory costs one memory read from device memory only on a cache miss; otherwise, it just costs one read from the constant cache. Accesses to different addresses by threads within a warp are serialized, thus the cost scales linearly with the number of unique addresses read by all threads within a warp. As such, the constant cache is best when threads in the same warp accesses only a few distinct locations. If all threads of a warp access the same location, then constant memory can be as fast as a register access.

关于这与其他内存类型的比较，这是我的简短回答。您可能需要阅读 this page 以了解更多详情：

Registers: Thread private on-chip read + write memory which can be considered as the fastest memory space on a GPU.
本地内存：线程专用的片外读+写内存，尽管它的名称具有误导性，但在物理上与全局内存的位置相同。因此，它的延迟很高。
全局内存：具有高延迟和片外全局作用域的最大内存具有读写权限。
常量内存：片外缓存的只读内存限制为 64 KB，如果经线访问相同的位置。
共享内存：片上、低延迟、读写限制 space 每个多处理器（48 KB 到 164 KB，具体取决于取决于您设备的计算能力）。
纹理内存：针对 2D 空间局部性优化的片上缓存只读内存，支持硬件过滤等独特功能。
固定（页面锁定）内存：不是显式设备内存。 CPU 和 GPU 代码均可直接访问，用于最大化和重叠 CPU/GPU.
之间的数据传输

这些内存有不同的范围、寿命和用途。您在问题中提到的 Numba 页面解释了基础知识，但官方 CUDA 编程指南有更多详细信息。归根结底，何时使用每个内存的问题的答案在很大程度上取决于应用程序。

与本地和共享内存相比，Numba 允许设备分配的常量内存有多快或多慢？

How fast or slow is the Constant memory that Numba allows a device to allocate, when compared to local and shared memories?

memory-management

cuda

python-3.x

numba

numba-pro