如何访问 AMD 的微标记 L1 数据缓存？

How is AMD's micro-tagged L1 data cache accessed?

正在学习AMD处理器L1缓存的访问过程。但是AMD的说明书反复看，还是看不懂

我对Intel的L1数据缓存的理解是：
L1缓存是虚拟索引和物理标记的。因此，利用虚拟地址的索引位来找到对应的缓存集，最后根据tag确定缓存集中的缓存行是哪个。
（英特尔使他们的 L1d 高速缓存足够关联且足够小，以至于索引位仅来自物理地址中相同的页内偏移量。因此他们通过 none 别名问题获得了 VIPT 的速度，表现得像 PIPT。）

但是AMD用了一个新的方法。在 Zen 1 中，他们有一个 32 KB、8 路组关联 L1d 缓存，它（与 64KB 4 路 L1i 不同）足够小，可以避免没有微标签的混叠问题。
来自 AMD's 2017 Software Optimization Manual，第 2.6.2.2 节“AMD 系列 17h 处理器的微体系结构”（Zen 1）：

The L1 data cache tags contain a linear-address-based microtag (utag) that tags each cacheline with the linear address that was used to access the cacheline initially. Loads use this utag to determine which way of the cache to read using their linear address, which is available before the load's physical address has been determined via the TLB. The utag is a hash of the load's linear address. This linear address based lookup enables a very accurate prediction of in which way the cacheline is located prior to a read of the cache data. This allows a load to read just a single cache way, instead of all 8. This saves power and reduces bank conflicts.

It is possible for the utag to be wrong in both directions: it can predict hit when the access will miss, and it can predict miss when the access could have hit. In either case, a fill request to the L2 cache is initiated and the utag is updated when L2 responds to the fill request.

Linear aliasing occurs when two different linear addresses are mapped to the same physical address. This can cause performance penalties for loads and stores to the aliased cachelines. A load to an address that is valid in the L1 DC but under a different linear alias will see an L1 DC miss, which requires an L2 cache request to be made. The latency will generally be no larger than that of an L2 cache hit. However, if multiple aliased loads or stores are in-flight simultaneously, they each may experience L1 DC misses as they update the utag with a particular linear address and remove another linear address from being able to access the cacheline.

It is also possible for two different linear addresses that are NOT aliased to the same physical address to conflict in the utag, if they have the same linear hash. At a given L1 DC index (11:6), only one cacheline with a given linear hash is accessible at any time; any cachelines with matching linear hashes are marked invalid in the utag and are not accessible.

utag有可能双向都错

第二段这句话的具体场景是什么？什么情况下hit会被预测为miss，miss会被预测为hit？当CPU从内存中访问数据到缓存中时，会根据utag计算出一条缓存路。就把它放在这里？即使其他缓存方式为空？

当两个不同的线性地址映射到同一物理地址时，会出现线性别名。

不同的线性地址如何映射到同一个物理地址？

但是，如果同时进行多个别名加载或存储，它们每个都可能会遇到 L1 DC 未命中，因为它们使用特定的线性地址更新 utag 并删除另一个线性地址，使其无法访问缓存行。

这句话是什么意思？我的理解是先根据线性地址（虚拟地址）计算utag来决定使用哪种缓存方式。然后通过物理地址的tag字段判断是否缓存命中？ utag 是如何更新的？会不会记录在缓存中？

任何具有匹配线性哈希值的缓存行在 utag 中都被标记为无效并且不可访问。这句话是什么意思？

AMD如何判断缓存命中或未命中？为什么有些命中被视为未命中？有人可以解释吗？非常感谢！

The L1 data cache tags contain a linear-address-based microtag (utag) that tags each cacheline with the linear address that was used to access the cacheline initially.

L1D 中的每个缓存行都有一个关联的 utag。这意味着 utag 内存结构的组织方式与 L1D 完全相同（即 8 种方式和 64 组），并且条目之间存在 one-to-one 对应关系。 utag是根据导致L1D中填充行的请求的线性地址计算的

Loads use this utag to determine which way of the cache to read using their linear address, which is available before the load's physical address has been determined via the TLB.

一个负载的线性地址同时发送给路预测器和TLB（最好使用术语MMU，因为有多个TLB）。使用线性地址 (11:6) 的某些位选择 utag 存储器中的特定集合，并同时读取该集合中的所有 8 个 utag。同时，根据加载请求的线性地址计算utag。当这两个操作都完成时，将给定的 utag 与集合中存储的所有 utag 进行比较。维护 utag 内存，以便在每个集合中最多可以有一个具有相同值的 utag。如果在 utag 内存中命中，预测器预测目标缓存行位于 L1D 中相应缓存条目中的方式。到目前为止，还不需要物理地址。

The utag is a hash of the load's linear address.

第 3 部分标题为 Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors 的论文中的散列函数 reverse-engineered 用于许多微体系结构。基本上，位置 27:12 处的线性地址的某些位相互异或以产生一个 8 位值，即 utag。一个好的哈希函数应该：（1）最小化映射到同一个 utag 的线性地址对的数量，（2）最小化 utag 的大小，以及（3）延迟不大于 utag 内存访问延迟。

This linear address based lookup enables a very accurate prediction of in which way the cacheline is located prior to a read of the cache data. This allows a load to read just a single cache way, instead of all 8. This saves power and reduces bank conflicts.

除了utag内存和相关逻辑外，L1D还包括一个标签内存和一个数据内存，它们的结构都是一样的。标签存储器存储物理标签（物理地址的第 6 位到最高位）。数据存储器存储高速缓存行。在 utag 中命中的情况下，路预测器只读取标签内存和数据内存中相应路的一个条目。在现代 x86 处理器上，物理地址的大小超过 35 位，因此物理标签的大小超过 29 位。这比 utag 的大小大 3 倍多。在没有路预测的情况下，在具有多个缓存路的缓存中，必须并行读取和比较多个标签。在8路缓存中，读取和比较1个标签比读取和比较8个标签消耗更少的能量。

在每条路都可以单独激活的缓存中，每个缓存条目都有自己的字线，与跨多个缓存路共享的世界线相比，它更短。由于信号传播延迟，读取单路比读取 8 路花费的时间更少。然而，在 parallelly-accessed 缓存中，没有办法预测延迟，但线性地址转换成为加载延迟的关键路径。通过路径预测，来自预测条目的数据可以推测性地转发到相关微指令。这可以提供显着的加载延迟优势，特别是因为线性地址转换延迟可能会因 MMU 的 multi-level 设计而变化，即使在 MMU 命中的典型情况下也是如此。缺点是它引入了一个可能发生重放的新原因：在预测错误的情况下，可能需要重放数十甚至数百个微指令。我不知道 AMD 在验证预测之前是否真的转发了请求的数据，但即使手册中没有提到也有可能。

减少银行冲突是手册中提到的方式预测的另一个优势。这意味着在不同的银行中放置不同的方式。 2.6.2.1 节说地址的位 5:2、访问的大小和高速缓存路数决定了要访问的组。这表明有 16*8 = 128 个库，每个 4 字节块对应一个库。 bits5:2是从load的线性地址中获取，load的大小是从load uop中获取的，way number是从way predictor中获取的。 2.6.2 节说 L1D 在同一周期内支持两个 16 字节加载和一个 16 字节存储。这表明每个存储区都有一个 16 字节 read-write 端口。 128 个组端口中的每一个都通过互连连接到 L1D 数据存储器的 3 个端口中的每一个。 3 个端口之一连接到存储缓冲区，另外两个连接到加载缓冲区，可能具有用于有效处理 cross-line 负载的中间逻辑（单个加载 uop 但两个加载请求的结果合并），重叠负载（以避免银行冲突），以及跨越银行边界的负载。

路预测只需要访问 L1D 的标签存储器和数据存储器中的一个路这一事实允许减少或完全消除使标签和数据存储器真实的需要（取决于如何处理监听）多端口（这是英特尔在 Haswell 中采用的方法），同时仍然实现大约相同的吞吐量。但是，当同时访问相同的方式和相同的 5:2 地址位，但不同的 utags 时，仍然会发生 Bank 冲突。方式预测确实减少了 bank 冲突，因为它不需要为每次访问读取多个条目（至少在标签内存中，但也可能在数据内存中），但它并没有完全消除 bank 冲突。

也就是说，标签内存可能需要真正的多端口处理填充检查（见下文）、验证检查（见下文）、窥探和 non-load 访问的“正常路径”检查。我认为只有加载请求使用预测器的方式。其他类型请求正常处理。

高度准确的 L1D hit/miss 预测还有其他好处。如果预测负载在 L1D 中丢失，则可以抑制相关微指令的调度程序唤醒信号以避免可能的重播。此外，物理地址一旦可用，就可以在完全解析预测之前提前发送到 L2 缓存。不知道AMD有没有采用这些优化

It is possible for the utag to be wrong in both directions: it can predict hit when the access will miss, and it can predict miss when the access could have hit. In either case, a fill request to the L2 cache is initiated and the utag is updated when L2 responds to the fill request.

在支持多个线性地址 space 或允许同一地址中的同义词 space 的 OS 上，只能使用物理地址唯一标识缓存行。如前所述，在 utag 内存中查找 utag 时，可能有一次命中或零次命中。首先考虑命中情况。这种线性 address-based 查找导致推测命中，仍需要验证。即使禁用分页，utag 仍然不是完整地址的唯一替代品。一旦 MMU 提供了物理地址，就可以通过将来自预测路径的物理标签与来自访问物理地址的标签进行比较来验证预测。可能会出现以下情况之一：

物理标签匹配，推测命中被视为真实命中。除了可能触发预取或更新行的替换状态外，无需执行任何操作。
物理标签不匹配，目标行不存在于同组的任何其他条目中。请注意，目标行不可能存在于其他集合中，因为所有 L1D 内存都使用相同的集合索引功能。稍后我将讨论如何处理。
物理标签不匹配，并且目标行确实存在于同一组的另一个条目中（与不同的 utag 相关联）。稍后我将讨论如何处理。

如果在 utag 内存中没有找到匹配的 utag，将没有物理标签可供比较，因为无法预测。可能会出现以下情况之一：

L1D 中实际上不存在目标线，因此推测未命中是真未命中。该行必须从其他地方获取。
目标行实际存在于同一个集合中，但具有不同的 utag。稍后我将讨论如何处理。

（我在这里做了两个简化。首先，加载请求被假定为可缓存内存。其次，在 L1D 中的推测或真实命中，数据中没有检测到错误。我'我正在努力专注于第 2.6.2.2 节。）

仅在情况 3 和 5 中需要访问 L2，而在情况 2 和 4 中则不需要。确定是哪种情况的唯一方法是将负载的物理标签与所有现有线路的物理标签进行比较在同一套。这可以在访问 L2 之前或之后完成。无论哪种方式，都必须避免在 L1D 中出现同一行的多个副本的可能性。在访问 L2 之前进行检查会改善情况 3 和 5 的延迟，但会损害情况 2 和 4 的延迟。在访问 L2 之后进行检查会改善情况 2 和 4 的延迟，但会损害情况 3 和 5 的延迟。可以同时执行检查并向 L2 发送请求。但这在情况 3 和情况 5 中可能会浪费能量和 L2 带宽。似乎 AMD 决定在从 L2（包括 L1 缓存）获取行后进行检查。

当线路从 L2 到达时，L1D 不必等到它被填充到其中以响应请求的数据，因此更高的填充延迟是可以容忍的。现在比较物理标签以确定发生了 4 种情况中的哪一种。在情况 4 中，该行以替换策略选择的方式填充到数据内存、标签内存和 utag 内存中。在情况 2 中，请求的线路替换了恰好具有相同 utag 的现有线路，并且替换策略未参与选择方式。即使在同一个集合中有一个空条目，也会发生这种情况，从根本上减少缓存的有效容量。在情况 5 中，utag 可以简单地被覆盖。情况 3 有点复杂，因为它涉及一个具有匹配物理标签的条目和一个具有匹配 utag 的不同条目。其中之一将有失效，另一个必须被替换。在这种情况下也可以存在未使用的空条目。

Linear aliasing occurs when two different linear addresses are mapped to the same physical address. This can cause performance penalties for loads and stores to the aliased cachelines. A load to an address that is valid in the L1 DC but under a different linear alias will see an L1 DC miss, which requires an L2 cache request to be made. The latency will generally be no larger than that of an L2 cache hit. However, if multiple aliased loads or stores are in-flight simultaneously, they each may experience L1 DC misses as they update the utag with a particular linear address and remove another linear address from being able to access the cacheline.

这就是情况 5（以及程度较小的情况 2）的发生方式。线性别名可能发生在同一线性地址 space 和不同地址 space 之间（上下文切换和超线程效应发挥作用）。

It is also possible for two different linear addresses that are NOT aliased to the same physical address to conflict in the utag, if they have the same linear hash. At a given L1 DC index (11:6), only one cacheline with a given linear hash is accessible at any time; any cachelines with matching linear hashes are marked invalid in the utag and are not accessible.

这就是案例 2 和案例 3 的发生方式，它们的处理方式如前所述。这部分讲述了L1D使用了简单的集合索引功能；设置的数字是位 11:6.

我认为大页面使情况 2 和 3 更有可能发生，因为 utag 哈希函数使用的一半以上的位成为页面偏移量的一部分而不是页码。多个 OS 进程之间共享的物理内存使情况 5 更有可能。

如何访问 AMD 的微标记 L1 数据缓存？

How is AMD's micro-tagged L1 data cache accessed?

x86

caching

cpu-architecture

cpu-cache

amd-processor