Compute Capability 7.5 中的缓存行为

Cache behaviour in Compute Capability 7.5

这些是我的假设：

加载有两种类型，缓存和非缓存。第一个，流量走L1和L2，而第二个，流量只走L2
Compute Capability 6.x 和 7.x 中的默认行为是缓存访问。
L1 缓存行为 128 字节，L2 缓存行为 32 字节，因此对于生成的每个 L1 事务，应该有四个 L2 事务（每个扇区一个。）
在 Nsight 中，一个 SM->TEX 请求是指从 32 个线程合并而来的 warp 级指令。 L2->TEX Returns 和 TEX->SM Returns 是衡量每个内存单元之间传输了多少扇区的度量。

假设计算能力为 7.5，这些是我的问题：

第三个假设似乎暗示 L2->TEX Returns 对于全局缓存加载应该始终是四的倍数，但情况并非总是如此。这里发生了什么？
用 const 和 __restrict__ 限定符标记指针还有意义吗？这曾经是对编译器的提示，数据是只读的，因此可以缓存在 L1/texture 缓存中，但现在所有数据都缓存在那里，只读和非只读。
根据我的第四个假设，我认为每当 TEX->SM Returns 大于 L2->TEX Returns 时，差异来自缓存命中。这是因为当缓存命中时，您会从 L1 读取一些扇区，但从 L2 读取 none。这是真的吗？

CC 6.x/7.x

L1 缓存行大小为 128 字节，分为 4 个 32 字节的扇区。未命中时，只会从 L2 获取已寻址的扇区。
L2 缓存行大小为 128 字节，分为 4 个 32 字节的扇区。
- CC 7.0 (HBM) 64B 升级已启用。如果缓存行的低 64 字节未命中，将从 DRAM 中提取低 64 字节。如果缓存行的高 64 字节未命中，则将提取高 64 字节。
- CC 6.x/7.5 只会从 DRAM 中获取访问的 32B 扇区。
在一级缓存策略方面
- CC 6.0 默认启用负载缓存
- CC 6.x 默认禁用加载缓存 - 请参阅编程指南
- CC 7.x 默认启用加载缓存 - 有关缓存控制的详细信息，请参阅 PTX

在 Nsight Compute 中，术语请求在 6.x 和 7.x 之间变化。

对于5.x-6.x，每条指令的请求数因操作类型和数据宽度而异。例如32位加载为8threads/request，64位加载为4threads/request，128位加载为2threads/request.
对于 7.x 请求应该等同于指令，除非访问模式具有导致序列化的地址分歧。

回答您的 CC 7.5 问题

The third assumption seems to imply that L2->TEX Returns should always be a multiple of four for global cached loads, but that's not always the case. What is happening here?

L1TEX 单元只会获取缓存行中丢失的 32B 扇区。

Is there still a point in marking pointers with const and restrict qualifiers? That used to be a hint to the compiler that the data is read-only and therefore can be cached in L1/texture cache, but now all data is cached there, both read-only and not read-only.

如果数据已知 read-only，编译器可以执行额外的优化。

From my fourth assumption, I would think that whenever TEX->SM Returns is greater than L2->TEX Returns, the difference comes from cache hits. That's because when there's a cache hit, you get some sectors read from L1, but none from L2. Is this true?

L1TEX 到 SM return B/W 是 128B/周期。 L2 到 SM return B/W 在 32B 扇区中。

Nsight 计算内存工作负载分析 | L1/TEX缓存table显示

L2 的扇区缺失（32B 扇区）
Returns 到 SM（周期 == 1-128B）

Compute Capability 7.5 中的缓存行为

Cache behaviour in Compute Capability 7.5

caching

cuda

gpgpu

nsight

compute-capability