如果数据已经在缓存中,非临时存储会发生什么情况?
What happens with a non-temporal store if the data is already in cache?
当您使用非临时存储时,例如movntq,并且数据已经在缓存中,存储会更新缓存而不是写出到内存吗?或者它会更新缓存行并将其写出,驱逐它吗?或者什么?
这是一个有趣的难题。假设线程 A 正在加载包含 x 和 y 的缓存行。线程 B 使用 NT 存储写入 x。线程 A 写入 y。如果 B 对 x 的存储可以在 A 的加载发生时传输到内存,则这里存在数据竞争。如果 A 看到 x 的旧值,但 X 的写入已经发生,那么稍后写入 y 并最终写回缓存行将破坏不相关的值 x。我假设处理器以某种方式阻止了这种情况的发生?如果这是允许的行为,我看不出任何人如何使用 NT 商店构建可靠的系统。
您描述的所有行为都是 non-temporal 商店的合理实施。实际上,在现代 x86 CPU 上,实际语义是对 L1 缓存没有影响,但 L2(和 higher-level 缓存,如果有的话)不会驱逐缓存行来存储 non-temporal 提取结果。
没有数据竞争,因为缓存是硬件一致的。驱逐缓存行的决定不会以任何方式影响这种一致性。
在多核 CPU 上(即比 Pentium M 更新),如果目标缓存行已经存在于缓存层次结构中,它将被 NT 存储逐出,在 NT 存储发生之前。
如果缓存行被修改(并且需要回写),这可能是低效的;在这种情况下,普通商店 + clflush
可能会更好。 IDK 线路干净时的成本是多少; NT 存储本身在到达内存控制器的途中通过缓存层次结构移动可能会进行逐出以确保在修改 RAM 后没有其他核心仍然可以拥有陈旧的缓存副本。
来自Intel's x86 volume 1 manual, ch 10.4.6.2 Caching of Temporal vs. Non-Temporal Data:
If a program specifies a non-temporal store with one of these instructions
and the memory type of the destination region is write back (WB), write through (WT), or write combining (WC), the processor will do the following:
If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted.1
1 Some older CPU implementations (e.g., Pentium M) allowed addresses being written with a non-temporal store instruction to be
updated in-place if the memory type was not WC and line was already in the cache.
The non-temporal data is written to memory with WC semantics.
See also: Chapter 11, “Memory Cache Control,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
来自 Intel's optimization manual,7.4.1.3 内存类型和非临时存储。我已经崩溃到 [].
中的摘要
Memory type can take precedence over a non-temporal hint, leading to
the following considerations:
- [NT is ignored for UC and WP: strongly-ordered uncacheable memory.]
If the programmer specifies the weakly-ordered uncacheable memory type of Write-Combining (WC), then the non-temporal store and the
region have the same semantics and there is no conflict.
If the programmer specifies a non-temporal store to cacheable memory (for example, Write-Back
(WB) or Write-Through (WT) memory types), two cases may result:
— CASE 1 — If the data is present in the cache hierarchy, the instruction will ensure consistency. A
particular processor may choose different ways to implement this. The following approaches are
probable:
(a) updating data in-place in the cache hierarchy while preserving the memory type semantics assigned to that region or
(b) evicting the data from the caches and writing the new
non-temporal data to memory (with WC semantics).
The approaches (separate or combined) can be different for future processors. Pentium 4, Intel
Core Solo and Intel Core Duo processors implement the latter policy (of evicting data from all
processor caches). The Pentium M processor implements a combination of both approaches.
If the streaming store hits a line that is present in the first-level cache, the store data is combined
in place within the first-level cache. If the streaming store hits a line present in the second-level,
the line and stored data is flushed from the second-level to system memory. [I think this whole paragraph is describing Pentium M's "combined" approach]
— CASE 2 — If the data is not present in the cache hierarchy and the destination region is mapped
as WB or WT; the transaction will be weakly ordered and is subject to all WC memory semantics.
This non-temporal store will not write-allocate. Different implementations may choose to collapse and combine such stores.
当您使用非临时存储时,例如movntq,并且数据已经在缓存中,存储会更新缓存而不是写出到内存吗?或者它会更新缓存行并将其写出,驱逐它吗?或者什么?
这是一个有趣的难题。假设线程 A 正在加载包含 x 和 y 的缓存行。线程 B 使用 NT 存储写入 x。线程 A 写入 y。如果 B 对 x 的存储可以在 A 的加载发生时传输到内存,则这里存在数据竞争。如果 A 看到 x 的旧值,但 X 的写入已经发生,那么稍后写入 y 并最终写回缓存行将破坏不相关的值 x。我假设处理器以某种方式阻止了这种情况的发生?如果这是允许的行为,我看不出任何人如何使用 NT 商店构建可靠的系统。
您描述的所有行为都是 non-temporal 商店的合理实施。实际上,在现代 x86 CPU 上,实际语义是对 L1 缓存没有影响,但 L2(和 higher-level 缓存,如果有的话)不会驱逐缓存行来存储 non-temporal 提取结果。
没有数据竞争,因为缓存是硬件一致的。驱逐缓存行的决定不会以任何方式影响这种一致性。
在多核 CPU 上(即比 Pentium M 更新),如果目标缓存行已经存在于缓存层次结构中,它将被 NT 存储逐出,在 NT 存储发生之前。
如果缓存行被修改(并且需要回写),这可能是低效的;在这种情况下,普通商店 + clflush
可能会更好。 IDK 线路干净时的成本是多少; NT 存储本身在到达内存控制器的途中通过缓存层次结构移动可能会进行逐出以确保在修改 RAM 后没有其他核心仍然可以拥有陈旧的缓存副本。
来自Intel's x86 volume 1 manual, ch 10.4.6.2 Caching of Temporal vs. Non-Temporal Data:
If a program specifies a non-temporal store with one of these instructions and the memory type of the destination region is write back (WB), write through (WT), or write combining (WC), the processor will do the following:
If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted.1
1 Some older CPU implementations (e.g., Pentium M) allowed addresses being written with a non-temporal store instruction to be updated in-place if the memory type was not WC and line was already in the cache.
The non-temporal data is written to memory with WC semantics.
See also: Chapter 11, “Memory Cache Control,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
来自 Intel's optimization manual,7.4.1.3 内存类型和非临时存储。我已经崩溃到 [].
中的摘要Memory type can take precedence over a non-temporal hint, leading to the following considerations:
- [NT is ignored for UC and WP: strongly-ordered uncacheable memory.]
If the programmer specifies the weakly-ordered uncacheable memory type of Write-Combining (WC), then the non-temporal store and the region have the same semantics and there is no conflict.
If the programmer specifies a non-temporal store to cacheable memory (for example, Write-Back (WB) or Write-Through (WT) memory types), two cases may result:
— CASE 1 — If the data is present in the cache hierarchy, the instruction will ensure consistency. A particular processor may choose different ways to implement this. The following approaches are probable:
(a) updating data in-place in the cache hierarchy while preserving the memory type semantics assigned to that region or
(b) evicting the data from the caches and writing the new non-temporal data to memory (with WC semantics).The approaches (separate or combined) can be different for future processors. Pentium 4, Intel Core Solo and Intel Core Duo processors implement the latter policy (of evicting data from all processor caches). The Pentium M processor implements a combination of both approaches.
If the streaming store hits a line that is present in the first-level cache, the store data is combined in place within the first-level cache. If the streaming store hits a line present in the second-level, the line and stored data is flushed from the second-level to system memory. [I think this whole paragraph is describing Pentium M's "combined" approach]
— CASE 2 — If the data is not present in the cache hierarchy and the destination region is mapped as WB or WT; the transaction will be weakly ordered and is subject to all WC memory semantics. This non-temporal store will not write-allocate. Different implementations may choose to collapse and combine such stores.