什么时候应该优先使用纹理内存而不是常量内存?
When should texture memory be prefered over constant memory?
如果数据请求频率在线程中非常高(每个线程至少从特定列中选取一个数据),那么在常量内存中使用数据存储是否比 Pascal 体系结构中的纹理有任何优势?
编辑:这是 this question 的拆分版本,用于改进社区搜索
如果满足对常量内存使用的期望,那么在一般情况下使用常量内存是个好主意。它允许您的代码利用 GPU 硬件提供的额外缓存机制,从而减轻代码其他部分使用纹理的压力.
由于常量内存及其缓存,由于纹理和表面内存及其自身缓存由硬件定义Compute Capability,因此应考虑目标硬件。因此,常量内存和纹理内存的选项取决于访问模式和缓存使用,如缓存可用性。
恒定内存性能与 warp 中线程间的数据广播有关,因此如果所有线程都请求完全相同的数据地址并且数据已经在缓存中,则可以实现最大性能。因此,如果在同一个 warp 中有对多个地址的请求,服务将被拆分为多个请求,因为它可以为每个操作检索一个地址。 如果由于从多个地址检索数据而导致的拆分请求数过高,在这种特定情况下,纹理和表面内存性能可能优于常量内存。。此信息在 Cuda Programming Guide:
中有详细说明
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are
different memory addresses in the initial request, decreasing
throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the
constant cache in case of a cache hit, or at the throughput of device
memory otherwise.
纹理内存缓存比常量内存缓存更灵活。它可以利用以 2D 方式靠近在一起的同一地址扭曲中的读数。 尽管与常量内存相比有一些优势,但一般来说,如果数据访问模式或数据大小不符合常量内存要求或使用纹理内存缓存,则应使用纹理内存.更详细的资料可以found at:
The texture and surface memory spaces
reside in device memory and are cached in texture cache, so a texture
fetch or surface read costs one memory read from device memory only on
a cache miss, otherwise it just costs one read from texture cache. The
texture cache is optimized for 2D spatial locality, so threads of the
same warp that read texture or surface addresses that are close
together in 2D will achieve best performance. Also, it is designed for
streaming fetches with a constant latency; a cache hit reduces DRAM
bandwidth demand but not fetch latency.
Reading device memory through texture or surface fetching present some
benefits that can make it an advantageous alternative to reading
device memory from global or constant memory:
- If the memory reads do not follow the access patterns that global or
constant memory reads must follow to get good performance, higher
bandwidth can be achieved providing that there is locality in the
texture fetches or surface reads;
- Addressing calculations are
performed outside the kernel by dedicated units;
- Packed data may be
broadcast to separate variables in a single operation;
- 8-bit and
16-bit integer input data may be optionally converted to 32 bit
floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see
Texture Memory).
开发人员应该牢记,利用纹理内存与常量内存的组合比单独使用纹理内存具有真正的优势,因为它可以利用专用的来自两者的缓存,因为这两个缓存都比在缓存外检索的任何数据(即设备内存)具有更高的性能。
如果数据请求频率在线程中非常高(每个线程至少从特定列中选取一个数据),那么在常量内存中使用数据存储是否比 Pascal 体系结构中的纹理有任何优势?
编辑:这是 this question 的拆分版本,用于改进社区搜索
如果满足对常量内存使用的期望,那么在一般情况下使用常量内存是个好主意。它允许您的代码利用 GPU 硬件提供的额外缓存机制,从而减轻代码其他部分使用纹理的压力.
由于常量内存及其缓存,由于纹理和表面内存及其自身缓存由硬件定义Compute Capability,因此应考虑目标硬件。因此,常量内存和纹理内存的选项取决于访问模式和缓存使用,如缓存可用性。
恒定内存性能与 warp 中线程间的数据广播有关,因此如果所有线程都请求完全相同的数据地址并且数据已经在缓存中,则可以实现最大性能。因此,如果在同一个 warp 中有对多个地址的请求,服务将被拆分为多个请求,因为它可以为每个操作检索一个地址。 如果由于从多个地址检索数据而导致的拆分请求数过高,在这种特定情况下,纹理和表面内存性能可能优于常量内存。。此信息在 Cuda Programming Guide:
中有详细说明The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
纹理内存缓存比常量内存缓存更灵活。它可以利用以 2D 方式靠近在一起的同一地址扭曲中的读数。 尽管与常量内存相比有一些优势,但一般来说,如果数据访问模式或数据大小不符合常量内存要求或使用纹理内存缓存,则应使用纹理内存.更详细的资料可以found at:
The texture and surface memory spaces reside in device memory and are cached in texture cache, so a texture fetch or surface read costs one memory read from device memory only on a cache miss, otherwise it just costs one read from texture cache. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture or surface addresses that are close together in 2D will achieve best performance. Also, it is designed for streaming fetches with a constant latency; a cache hit reduces DRAM bandwidth demand but not fetch latency.
Reading device memory through texture or surface fetching present some benefits that can make it an advantageous alternative to reading device memory from global or constant memory:
- If the memory reads do not follow the access patterns that global or constant memory reads must follow to get good performance, higher bandwidth can be achieved providing that there is locality in the texture fetches or surface reads;
- Addressing calculations are performed outside the kernel by dedicated units;
- Packed data may be broadcast to separate variables in a single operation;
- 8-bit and 16-bit integer input data may be optionally converted to 32 bit floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see Texture Memory).
开发人员应该牢记,利用纹理内存与常量内存的组合比单独使用纹理内存具有真正的优势,因为它可以利用专用的来自两者的缓存,因为这两个缓存都比在缓存外检索的任何数据(即设备内存)具有更高的性能。