通过降低关联性增强 Skylake L2 缓存？

Skylake L2 cache enhanced by reducing associativity?

在 Intel's optimization guide 的第 2.1.3 节中，他们列出了 Skylake 中缓存和内存子系统的一些增强功能（重点是我的）：

The cache hierarchy of the Skylake microarchitecture has the following enhancements:

Higher Cache bandwidth compared to previous generations.

Simultaneous handling of more loads and stores enabled by enlarged buffers.

Processor can do two page walks in parallel compared to one in Haswell microarchitecture and earlier generations.

Page split load penalty down from 100 cycles in previous generation to 5 cycles.

L3 write bandwidth increased from 4 cycles pe r line in previous generation to 2 per line.

Support for the CLFLUSHOPT instruction to flush ca che lines and manage memory ordering of flushed data using SFENCE.

Reduced performance penalty for a software prefetch that specifies a NULL pointer.

L2 associativity changed from 8 ways to 4 ways.

最后一个引起了我的注意。方式数量的减少以何种方式增强？就其本身而言，似乎更少的方法比更多的方法更糟糕。当然，我知道可能有有效的工程原因，为什么减少方法数量可能是一种权衡，可以实现其他增强功能，但在这里它本身被定位为一种增强功能。

我错过了什么？

二级缓存的性能更差。

根据this AnandTech writeup of SKL-SP (aka skylake-avx512 or SKL-X)，英特尔表示"the main reason [for reducing associativity] was to make the design more modular"。 Skylake-AVX512 具有 1MiB 的二级缓存，具有 16 路关联性。

据推测，下降到 4 向结合性不会对造成严重的伤害在双和 quad-core 笔记本电脑和台式机部分 (SKL-S)，因为 L3 缓存有很多带宽。我认为如果英特尔的模拟和测试发现它伤害很大，他们会投入额外的设计时间以在非 AVX512 Skylake 上保留 8 路 256k 缓存。

较低结合性的好处是功率预算。它可以通过允许更多的涡轮净空来间接提高性能，但他们这样做主要是为了提高效率，而不是为了提高速度。 在功率预算中腾出一些空间可以让他们用在其他地方。还是不花光，省电。

移动和 many-core-server CPU 非常关心功率预算，比 high-end quad-core 台式机 CPU 更关心。

列表上的标题应该更准确地读作 "changes"，而不是 "enhancements"，但我敢肯定市场部不会让他们这样做写任何听起来不积极的东西。 :P 至少英特尔准确详细地记录了事情，包括新 CPU 比旧设计差的方式。

Anandtech's SKL writeup 表明降低关联性释放了功率预算以增加 L2 带宽，这（在大图中）补偿了增加的未命中率。

IIRC，英特尔有一项政策，即任何拟议的设计变更都必须具有 2:1 性能增益与功耗成本之比，或类似的东西。所以大概如果他们损失了 1% 的性能但通过这个 L2 更改节省了 3% 的功率，他们就会这样做。 2:1 数字可能是正确的，如果我没记错的话，但是 1% 和 3% 的例子完全是编造的。

David Kanter 在 IDF 发布详细信息后立即进行的播客采访中讨论了这一变化。 IDK if this is the right link.

通过降低关联性增强 Skylake L2 缓存？

Skylake L2 cache enhanced by reducing associativity?

cpu

x86

intel

cpu-cache