我如何在实践中创建幽灵小工具？

Question

我正在开发（NASM + GCC 目标 ELF64）PoC that uses a spectre gadget that measures the time to access a set of cache lines (FLUSH+RELOAD）。

如何制作可靠的幽灵小工具？

我相信我理解 FLUSH+RELOAD 技术背后的理论，但是在实践中，尽管有一些噪音，我无法产生一个有效的 PoC。

由于我使用的是时间戳计数器并且负载非常规律，所以我使用此脚本来禁用预取器、turbo boost 和 fix/stabilize CPU 频率：

#!/bin/bash

sudo modprobe msr

#Disable turbo
sudo wrmsr -a 0x1a0 0x4000850089

#Disable prefetchers
sudo wrmsr -a 0x1a4 0xf

#Set performance governor
sudo cpupower frequency-set -g performance

#Minimum freq
sudo cpupower frequency-set -d 2.2GHz

#Maximum freq
sudo cpupower frequency-set -u 2.2GHz

我有一个连续缓冲区，在 4KiB 上对齐，大到足以跨越 256 个缓存行，这些缓存行由整数 GAP 行分隔。

SECTION .bss ALIGN=4096

 buffer:    resb 256 * (1 + GAP) * 64

我用这个函数刷了256行

flush_all:
 lea rdi, [buffer]              ;Start pointer
 mov esi, 256                   ;How many lines to flush

.flush_loop:
  lfence                        ;Prevent the previous clflush to be reordered after the load
  mov eax, [rdi]                ;Touch the page
  lfence                        ;Prevent the current clflush to be reordered before the load

  clflush  [rdi]                ;Flush a line
  add rdi, (1 + GAP)*64         ;Move to the next line

  dec esi
 jnz .flush_loop                ;Repeat

 lfence                         ;clflush are ordered with respect of fences ..
                                ;.. and lfence is ordered (locally) with respect of all instructions
 ret

该函数遍历所有行，接触其间的每一页（每一页不止一次）并刷新每一行。

然后我使用这个函数来分析访问。

profile:
 lea rdi, [buffer]           ;Pointer to the buffer
 mov esi, 256                ;How many lines to test
 lea r8, [timings_data]      ;Pointer to timings results

 mfence                      ;I'm pretty sure this is useless, but I included it to rule out ..
                             ;.. silly, hard to debug, scenarios

.profile: 
  mfence
  rdtscp
  lfence                     ;Read the TSC in-order (ignoring stores global visibility)

  mov ebp, eax               ;Read the low DWORD only (this is a short delay)

  ;PERFORM THE LOADING
  mov eax, DWORD [rdi]

  rdtscp
  lfence                     ;Again, read the TSC in-order

  sub eax, ebp               ;Compute the delta

  mov DWORD [r8], eax        ;Save it

  ;Advance the loop

  add r8, 4                  ;Move the results pointer
  add rdi, (1 + GAP)*64      ;Move to the next line

  dec esi                    ;Advance the loop
 jnz .profile

 ret

附录中给出了 MCVE，repository is available to clone.

当 GAP 设置为 0 时，用 taskset -c 0 链接和执行时，获取每一行所需的周期如下所示。

只从内存中加载了 64 行。

输出在不同的运行中是稳定的。如果我将 GAP 设置为 1 只有 32 行从内存中获取，当然 64 * (1+0) * 64 = 32 * (1+1) * 64 = 4096，所以这可能与分页有关？

如果在对前 64 行之一进行分析之前执行存储（但在刷新之后），输出将变为此

其他行的任何存储都给出第一种类型的输出。

我怀疑其中的数学有问题，但我需要另外几双眼睛才能找出问题所在。

编辑

易失性寄存器的误用，修复后现在输出不一致。
我看到普遍运行时序较低（~50 个周期），有时运行时序较高（~130 个周期）。
我不知道 130 个周期的数字从何而来（内存太低，缓存太高？）。

代码已在 MCVE（和存储库）中修复。

如果在分析之前对第一行中的任何一行执行存储，则输出中不会反映任何更改。

附录 - MCVE

BITS 64 DEFAULT REL GLOBAL main EXTERN printf EXTERN exit ;Space between lines in the buffer %define GAP 0 SECTION .bss ALIGN=4096 buffer: resb 256 * (1 + GAP) * 64 SECTION .data timings_data: TIMES 256 dd 0 strNewLine db `\n0x%02x: `, 0 strHalfLine db " ", 0 strTiming db `\e[48;5;16`, .importance db "0", db `m\e[38;5;15m%03u\e[0m `, 0 strEnd db `\n\n`, 0 SECTION .text ;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' ; ' ' ' ' ' ' ' ' ' ' ' ; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ ;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \ ; ; ;FLUSH ALL THE LINES OF A BUFFER FROM THE CACHES ; ; flush_all: lea rdi, [buffer] ;Start pointer mov esi, 256 ;How many lines to flush .flush_loop: lfence ;Prevent the previous clflush to be reordered after the load mov eax, [rdi] ;Touch the page lfence ;Prevent the current clflush to be reordered before the load clflush [rdi] ;Flush a line add rdi, (1 + GAP)*64 ;Move to the next line dec esi jnz .flush_loop ;Repeat lfence ;clflush are ordered with respect of fences .. ;.. and lfence is ordered (locally) with respect of all instructions ret ;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' ; ' ' ' ' ' ' ' ' ' ' ' ; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ ;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \ ; ; ;PROFILE THE ACCESS TO EVERY LINE OF THE BUFFER ; ; profile: lea rdi, [buffer] ;Pointer to the buffer mov esi, 256 ;How many lines to test lea r8, [timings_data] ;Pointer to timings results mfence ;I'm pretty sure this is useless, but I included it to rule out .. ;.. silly, hard to debug, scenarios .profile: mfence rdtscp lfence ;Read the TSC in-order (ignoring stores global visibility) mov ebp, eax ;Read the low DWORD only (this is a short delay) ;PERFORM THE LOADING mov eax, DWORD [rdi] rdtscp lfence ;Again, read the TSC in-order sub eax, ebp ;Compute the delta mov DWORD [r8], eax ;Save it ;Advance the loop add r8, 4 ;Move the results pointer add rdi, (1 + GAP)*64 ;Move to the next line dec esi ;Advance the loop jnz .profile ret ;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' ; ' ' ' ' ' ' ' ' ' ' ' ; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ ;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \ ; ; ;SHOW THE RESULTS ; ; show_results: lea rbx, [timings_data] ;Pointer to the timings xor r12, r12 ;Counter (up to 256) .print_line: ;Format the output xor eax, eax mov esi, r12d lea rdi, [strNewLine] ;Setup for a call to printf test r12d, 0fh jz .print ;Test if counter is a multiple of 16 lea rdi, [strHalfLine] ;Setup for a call to printf test r12d, 07h ;Test if counter is a multiple of 8 jz .print .print_timing: ;Print mov esi, DWORD [rbx] ;Timing value ;Compute the color mov r10d, 60 ;Used to compute the color mov eax, esi xor edx, edx div r10d ;eax = Timing value / 78 ;Update the color add al, '0' mov edx, '5' cmp eax, edx cmova eax, edx mov BYTE [strTiming.importance], al xor eax, eax lea rdi, [strTiming] call printf WRT ..plt ;Print a 3-digits number ;Advance the loop inc r12d ;Increment the counter add rbx, 4 ;Move to the next timing cmp r12d, 256 jb .print_line ;Advance the loop xor eax, eax lea rdi, [strEnd] call printf WRT ..plt ;Print a new line ret .print: call printf WRT ..plt ;Print a string jmp .print_timing ;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' ; ' ' ' ' ' ' ' ' ' ' ' ; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ ;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \ ; ; ;E N T R Y P O I N T ; ; ;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' ; ' ' ' ' ' ' ' ' ' ' ' ; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ ;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \ main: ;Flush all the lines of the buffer call flush_all ;Test the access times call profile ;Show the results call show_results ;Exit xor edi, edi call exit WRT ..plt

Answer 1

缓冲区是从 bss 部分分配的，因此当加载程序时，OS 会将所有 buffer 缓存行映射到同一个 CoW 物理页面。刷新所有行后，只有对虚拟地址space中前64行的访问在所有缓存级别¹中未命中，因为所有² 之后的访问都是对同一个 4K 页面。这就是为什么前 64 次访问的延迟落在主存延迟的范围内，而所有后续访问的延迟等于 L1 命中延迟 ³ 当 GAP 是零。

当GAP为1时，访问同一物理页的每隔一行，因此访问主内存的次数（L3未命中）为32（64的一半）。也就是说，前 32 个延迟将在主内存延迟的范围内，所有后面的延迟都将是 L1 命中。同样，当 GAP 为 63 时，所有访问都在同一行。因此，只有第一次访问才会错过所有缓存。

解决方法是将flush_all中的mov eax, [rdi]改为mov dword [rdi], 0，确保缓冲区分配在唯一的物理页中。（可以删除 flush_all 中的 lfence 指令，因为英特尔手册指出 clflush 不能通过写入 ⁴ 重新排序。）这保证了，初始化并刷新所有行后，所有访问都将错过所有缓存级别（但不是 TLB，请参阅：Does clflush also remove TLB entries?）。

您可以参考另一个 CoW 页面可能具有欺骗性的示例。

我在此答案的先前版本中建议删除对 flush_all 的调用并使用 GAP 值 63。通过这些更改，所有访问延迟似乎都非常高而且我错误地得出所有访问都缺少所有缓存级别的结论。就像我上面说的，GAP 值为 63，所有的访问都变成同一个缓存行，它实际上驻留在 L1 缓存中。然而，所有延迟都很高的原因是因为每次访问都是对不同的虚拟页面，并且 TLB 没有任何这些虚拟页面（到同一物理页面）的映射，因为通过删除对flush_all、none 个虚拟页面之前被触摸过。因此，测得的延迟表示 TLB 未命中延迟，即使正在访问的行位于 L1 缓存中。

我在这个答案的前一个版本中也错误地声称有一个 L3 预取逻辑不能通过 MSR 0x1A4 禁用。如果通过在 MSR 0x1A4 中设置其标志来关闭特定预取器，则它会完全关闭。此外，除了英特尔记录的数据预取器之外，没有其他数据预取器。

脚注：

(1) 如果您不禁用 DCU IP 预取器，它实际上会在刷新后将所有行预取回 L1，因此所有访问仍会命中 L1。

(2) 在极少数情况下，中断处理程序的执行或在同一内核上调度其他线程可能会导致某些行从 L1 和缓存层次结构的其他潜在级别中被逐出。

(3) 请记住，您需要减去 rdtscp 指令的开销。请注意，您实际使用的测量方法并不能使您可靠地区分 L1 命中和 L2 命中。参见：.

(4) 英特尔手册似乎没有指定 clflush 是否与读取一起排序，但在我看来是这样。

我如何在实践中创建幽灵小工具？

How can I create a spectre gadget in practice?

x86

assembly

caching

spectre