如何在现代 x86/amd64 芯片上关闭 L1、L2、L3 CPU 缓存?
How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?
x86/x86_64 体系结构的每个现代高性能 CPU 都有一些数据缓存层次结构:L1、L2,有时是 L3(在极少数情况下是 L4),以及加载的数据 from/to 主RAM缓存在其中一些。
有时程序员可能希望某些数据不缓存在某些或所有缓存级别中(例如,当想要 memset 16 GB 的 RAM 并将某些数据保留在缓存中时):有一些非时间性的(NT) 指令如 MOVNTDQA ( http://lwn.net/Articles/255364/)
但是是否有一种编程方式(对于某些 AMD 或 Intel CPU 系列,如 P3、P4、Core、Core i*,...)完全(但暂时)关闭部分或所有级别缓存,改变每个内存访问指令(全局或某些应用程序/RAM 区域)如何使用内存层次结构?例如:关闭L1,关闭L1和L2?或者将每个内存访问类型更改为 "uncached" UC(CR0 的 CD+NW 位???SDM vol3a 页面 423 424, 425 和“ 三级缓存禁用标志,[ 的第 6 位=32=] MSR(仅适用于基于 Intel NetBurst 微体系结构的处理器)— 允许 L3 缓存被禁用和启用,独立于 L1 和 L2 缓存。”)。
我认为这样的操作将有助于保护数据免受缓存侧通道 attacks/leaks 的影响,例如窃取 AES 密钥、隐蔽缓存通道 Meltdown/Spectre。尽管这种禁用会产生巨大的性能成本。
PS: 我记得很多年前在某个技术新闻网站上发布过这样的程序,但现在找不到了。它只是一个 Windows exe,用于将一些神奇的值写入 MSR 并使每个 Windows 程序 运行 在它之后变得非常慢。缓存被关闭,直到重新启动或直到使用 "undo" 选项启动程序。
The Intel's manual 3A,第 11.5.3 节,提供了一种 全局 禁用缓存的算法:
11.5.3 Preventing Caching
To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills, perform the following steps:
- Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 and the NW flag to 0.
- Flush all caches using the WBINVD instruction.
- Disable the MTRRs and set the default memory type to uncached or set all MTRRs for the uncached memory
type (see the discussion of the discussion of the TYPE field and the E flag in Section 11.11.2.1,
“IA32_MTRR_DEF_TYPE MSR”).
The caches must be flushed (step 2) after the CD flag is set to ensure system memory coherency. If the caches are
not flushed, cache hits on reads will still occur and data will be read from valid cache lines.
The intent of the three separate steps listed above addresses three distinct requirements: (i) discontinue new data
replacing existing data in the cache (ii) ensure data already in the cache are evicted to memory, (iii) ensure subsequent memory references observe UC memory type semantics. Different processor implementation of caching
control hardware may allow some variation of software implementation of these three requirements. See note below.
NOTES
Setting the CD flag in control register CR0 modifies the processor’s caching behaviour as indicated
in Table 11-5, but setting the CD flag alone may not be sufficient across all processor families to
force the effective memory type for all physical memory to be UC nor does it force strict memory
ordering, due to hardware implementation variations across different processor families. To force
the UC memory type and strict memory ordering on all of physical memory, it is sufficient to either
program the MTRRs for all physical memory to be UC memory type or disable all MTRRs.
For the Pentium 4 and Intel Xeon processors, after the sequence of steps given above has been
executed, the cache lines containing the code between the end of the WBINVD instruction and
before the MTRRS have actually been disabled may be retained in the cache hierarchy. Here, to remove code from the cache completely, a second WBINVD instruction must be executed after the
MTRRs have been disabled.
引用很长,但归结为这段代码
;Step 1 - Enter no-fill mode
mov eax, cr0
or eax, 1<<30 ; Set bit CD
and eax, ~(1<<29) ; Clear bit NW
mov cr0, eax
;Step 2 - Invalidate all the caches
wbinvd
;All memory accesses happen from/to memory now, but UC memory ordering may not be enforced still.
;For Atom processors, we are done, UC semantic is automatically enforced.
xor eax, eax
xor edx, edx
mov ecx, IA32_MTRR_DEF_TYPE ;MSR number is 2FFH
wrmsr
;P4 only, remove this code from the L1I
wbinvd
其中大部分不能从用户模式执行。
AMD's manual 2在7.6.2
节中提供了类似的算法
7.6.2 Cache Control Mechanisms
The AMD64 architecture provides a number of mechanisms for controlling the cacheability of memory. These are described in the following sections.
Cache Disable. Bit 30 of the CR0 register is the cache-disable bit, CR0.CD. Caching is enabled
when CR0.CD is cleared to 0, and caching is disabled when CR0.CD is set to 1. When caching is
disabled, reads and writes access main memory.
Software can disable the cache while the cache still holds valid data (or instructions). If a read or write
hits the L1 data cache or the L2 cache when CR0.CD=1, the processor does the following:
- Writes the cache line back if it is in the modified or owned state.
- Invalidates the cache line.
- Performs a non-cacheable main-memory access to read or write the data.
If an instruction fetch hits the L1 instruction cache when CR0.CD=1, some processor models may read
the cached instructions rather than access main memory. When CR0.CD=1, the exact behavior of L2
and L3 caches is model-dependent, and may vary for different types of memory accesses.
The processor also responds to cache probes when CR0.CD=1. Probes that hit the cache cause the
processor to perform Step 1. Step 2 (cache-line invalidation) is performed only if the probe is
performed on behalf of a memory write or an exclusive read.
Writethrough Disable. Bit 29 of the CR0 register is the not writethrough disable bit, CR0.NW. In
early x86 processors, CR0.NW is used to control cache writethrough behavior, and the combination of
CR0.NW and CR0.CD determines the cache operating mode.
[...]
In implementations of the AMD64 architecture, CR0.NW is not used to qualify the cache operating
mode established by CR0.CD.
翻译成这段代码(与英特尔的非常相似):
;Step 1 - Disable the caches
mov eax, cr0
or eax, 1<<30
mov cr0, eax
;For some models we need to invalidated the L1I
wbinvd
;Step 2 - Disable speculative accesses
xor eax, eax
xor edx, edx
mov ecx, MTRRdefType ;MSR number is 2FFH
wrmsr
缓存也可以select在以下位置彻底禁用:
- 页面级别,具有属性位 PCD(禁用页面缓存)[仅适用于 Pentium Pro 和 Pentium II]。
当两者都清楚时,使用相关的 MTTR,如果 PCD 设置为疼痛
- 页面级别,具有PAT(页面属性Table)机制。
通过用缓存类型填充 IA32_PAT
并使用位 PAT、PCD、PWT 作为 3 位索引,可以 select 六种缓存类型之一(UC-、UC、WC、WT、WP ,世界银行)。
- 使用 MTTR(固定或可变)。
通过为特定 physical 区域将缓存类型设置为 UC 或 UC-。
在这些选项中,只有页面属性可以暴露给用户模式程序(参见示例 this)。
x86/x86_64 体系结构的每个现代高性能 CPU 都有一些数据缓存层次结构:L1、L2,有时是 L3(在极少数情况下是 L4),以及加载的数据 from/to 主RAM缓存在其中一些。
有时程序员可能希望某些数据不缓存在某些或所有缓存级别中(例如,当想要 memset 16 GB 的 RAM 并将某些数据保留在缓存中时):有一些非时间性的(NT) 指令如 MOVNTDQA ( http://lwn.net/Articles/255364/)
但是是否有一种编程方式(对于某些 AMD 或 Intel CPU 系列,如 P3、P4、Core、Core i*,...)完全(但暂时)关闭部分或所有级别缓存,改变每个内存访问指令(全局或某些应用程序/RAM 区域)如何使用内存层次结构?例如:关闭L1,关闭L1和L2?或者将每个内存访问类型更改为 "uncached" UC(CR0 的 CD+NW 位???SDM vol3a 页面 423 424, 425 和“ 三级缓存禁用标志,[ 的第 6 位=32=] MSR(仅适用于基于 Intel NetBurst 微体系结构的处理器)— 允许 L3 缓存被禁用和启用,独立于 L1 和 L2 缓存。”)。
我认为这样的操作将有助于保护数据免受缓存侧通道 attacks/leaks 的影响,例如窃取 AES 密钥、隐蔽缓存通道 Meltdown/Spectre。尽管这种禁用会产生巨大的性能成本。
PS: 我记得很多年前在某个技术新闻网站上发布过这样的程序,但现在找不到了。它只是一个 Windows exe,用于将一些神奇的值写入 MSR 并使每个 Windows 程序 运行 在它之后变得非常慢。缓存被关闭,直到重新启动或直到使用 "undo" 选项启动程序。
The Intel's manual 3A,第 11.5.3 节,提供了一种 全局 禁用缓存的算法:
11.5.3 Preventing Caching
To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills, perform the following steps:
- Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 and the NW flag to 0.
- Flush all caches using the WBINVD instruction.
- Disable the MTRRs and set the default memory type to uncached or set all MTRRs for the uncached memory type (see the discussion of the discussion of the TYPE field and the E flag in Section 11.11.2.1, “IA32_MTRR_DEF_TYPE MSR”).
The caches must be flushed (step 2) after the CD flag is set to ensure system memory coherency. If the caches are not flushed, cache hits on reads will still occur and data will be read from valid cache lines.
The intent of the three separate steps listed above addresses three distinct requirements: (i) discontinue new data replacing existing data in the cache (ii) ensure data already in the cache are evicted to memory, (iii) ensure subsequent memory references observe UC memory type semantics. Different processor implementation of caching control hardware may allow some variation of software implementation of these three requirements. See note below.
NOTES Setting the CD flag in control register CR0 modifies the processor’s caching behaviour as indicated in Table 11-5, but setting the CD flag alone may not be sufficient across all processor families to force the effective memory type for all physical memory to be UC nor does it force strict memory ordering, due to hardware implementation variations across different processor families. To force the UC memory type and strict memory ordering on all of physical memory, it is sufficient to either program the MTRRs for all physical memory to be UC memory type or disable all MTRRs.
For the Pentium 4 and Intel Xeon processors, after the sequence of steps given above has been executed, the cache lines containing the code between the end of the WBINVD instruction and before the MTRRS have actually been disabled may be retained in the cache hierarchy. Here, to remove code from the cache completely, a second WBINVD instruction must be executed after the MTRRs have been disabled.
引用很长,但归结为这段代码
;Step 1 - Enter no-fill mode
mov eax, cr0
or eax, 1<<30 ; Set bit CD
and eax, ~(1<<29) ; Clear bit NW
mov cr0, eax
;Step 2 - Invalidate all the caches
wbinvd
;All memory accesses happen from/to memory now, but UC memory ordering may not be enforced still.
;For Atom processors, we are done, UC semantic is automatically enforced.
xor eax, eax
xor edx, edx
mov ecx, IA32_MTRR_DEF_TYPE ;MSR number is 2FFH
wrmsr
;P4 only, remove this code from the L1I
wbinvd
其中大部分不能从用户模式执行。
AMD's manual 2在7.6.2
节中提供了类似的算法7.6.2 Cache Control Mechanisms
The AMD64 architecture provides a number of mechanisms for controlling the cacheability of memory. These are described in the following sections.Cache Disable. Bit 30 of the CR0 register is the cache-disable bit, CR0.CD. Caching is enabled when CR0.CD is cleared to 0, and caching is disabled when CR0.CD is set to 1. When caching is disabled, reads and writes access main memory.
Software can disable the cache while the cache still holds valid data (or instructions). If a read or write hits the L1 data cache or the L2 cache when CR0.CD=1, the processor does the following:
- Writes the cache line back if it is in the modified or owned state.
- Invalidates the cache line.
- Performs a non-cacheable main-memory access to read or write the data.
If an instruction fetch hits the L1 instruction cache when CR0.CD=1, some processor models may read the cached instructions rather than access main memory. When CR0.CD=1, the exact behavior of L2 and L3 caches is model-dependent, and may vary for different types of memory accesses.
The processor also responds to cache probes when CR0.CD=1. Probes that hit the cache cause the processor to perform Step 1. Step 2 (cache-line invalidation) is performed only if the probe is performed on behalf of a memory write or an exclusive read.
Writethrough Disable. Bit 29 of the CR0 register is the not writethrough disable bit, CR0.NW. In early x86 processors, CR0.NW is used to control cache writethrough behavior, and the combination of CR0.NW and CR0.CD determines the cache operating mode.
[...]
In implementations of the AMD64 architecture, CR0.NW is not used to qualify the cache operating mode established by CR0.CD.
翻译成这段代码(与英特尔的非常相似):
;Step 1 - Disable the caches
mov eax, cr0
or eax, 1<<30
mov cr0, eax
;For some models we need to invalidated the L1I
wbinvd
;Step 2 - Disable speculative accesses
xor eax, eax
xor edx, edx
mov ecx, MTRRdefType ;MSR number is 2FFH
wrmsr
缓存也可以select在以下位置彻底禁用:
- 页面级别,具有属性位 PCD(禁用页面缓存)[仅适用于 Pentium Pro 和 Pentium II]。
当两者都清楚时,使用相关的 MTTR,如果 PCD 设置为疼痛 - 页面级别,具有PAT(页面属性Table)机制。
通过用缓存类型填充IA32_PAT
并使用位 PAT、PCD、PWT 作为 3 位索引,可以 select 六种缓存类型之一(UC-、UC、WC、WT、WP ,世界银行)。 - 使用 MTTR(固定或可变)。
通过为特定 physical 区域将缓存类型设置为 UC 或 UC-。
在这些选项中,只有页面属性可以暴露给用户模式程序(参见示例 this)。