直线代码指令导致的开销差异很大

Question

我正在尝试了解 Linux 块层中 [blk_account_io_completion][1] 的开销。使用 perf annotate 我得到以下片段（删节）。有人可以阐明 add 和 test 指令与其一起执行的相邻指令相比有这样的开销的原因吗？

         :                      part_stat_add(cpu, part, sectors[rw], bytes >> 9);
    0.13 :        ffffffff813336eb:       movsxd r8,r8d
    0.00 :        ffffffff813336ee:       lea    rdx,[rax*8+0x0]
    0.00 :        ffffffff813336f6:       mov    rcx,QWORD PTR [rdi+0x210]
   72.04 :        ffffffff813336fd:       add    rcx,QWORD PTR [r8*8-0x7e2df6a0]
    0.22 :        ffffffff81333705:       add    QWORD PTR [rcx+rdx*1],rsi
    0.61 :        ffffffff81333709:       mov    eax,DWORD PTR [rdi+0x1f4]
   26.52 :        ffffffff8133370f:       test   eax,eax
    0.00 :        ffffffff81333711:       je     ffffffff81333733 <blk_account_io_completion+0x83>

Answer 1

一个可能的原因是这些指令恰好在采样时被指令指针指向。一个典型的 x86 CPU 每个周期最多可以退出 4 条指令，但是当它这样做并且样本是令牌时，程序计数器将只指向一条指令，而不是所有这四条指令。

这是一个例子 - 见下文。带有一堆 nop 指令的简单普通循环。请注意时钟节拍是如何通过间隙中的三个指令分布在该配置文件上的。这可能与您看到的效果类似。

或者，可能是 mov rcx,QWORD PTR [rdi+0x210] 和 mov eax,DWORD PTR [rdi+0x1f4] 经常错过缓存，因为花在缓存上的周期被归因于下一条指令，例如 here.

       │    Disassembly of section .text:
       │
       │    00000000004004ed :
       │      push   %rbp
       │      mov    %rsp,%rbp
       │      movl   [=12=]x0,-0x4(%rbp)
       │    ↓ jmp    25
 14.59 │ d:   nop
       │      nop
       │      nop
  0.03 │      nop
 14.58 │      nop
       │      nop
       │      nop
  0.08 │      nop
 13.89 │      nop
       │      nop
  0.01 │      nop
  0.08 │      nop
 13.99 │      nop
       │      nop
  0.01 │      nop
  0.05 │      nop
 13.92 │      nop
       │      nop
  0.01 │      nop
  0.07 │      nop
 14.44 │      addl   [=12=]x1,-0x4(%rbp)
  0.33 │25:   cmpl   [=12=]x3fffffff,-0x4(%rbp)
 13.90 │    ↑ jbe    d
       │      pop    %rbp
       │    ← retq

直线代码指令导致的开销差异很大

Big difference in overhead caused by instructions in straight-line code

linux

profiling

linux-kernel

perf