如何减少在一条指令上花费的时间？

Question

我正在尝试优化 C 中的代码，一条指令似乎占用了大约 22% 的时间。

代码是用 gcc 8.2.0 编译的。标志是 -O3 -DNDEBUG -g 和 -Wall -Wextra -Weffc++ -pthread -lrt.

    509529.517218      task-clock (msec)         #    0.999 CPUs utilized
            6,234      context-switches          #    0.012 K/sec
               10      cpu-migrations            #    0.000 K/sec
        1,305,885      page-faults               #    0.003 M/sec
1,985,640,853,831      cycles                    #    3.897 GHz                      (30.76%)
1,897,574,410,921      instructions              #    0.96  insn per cycle           (38.46%)
  229,365,727,020      branches                  #  450.152 M/sec                    (38.46%)
   13,027,677,754      branch-misses             #    5.68% of all branches          (38.46%)
  604,340,619,317      L1-dcache-loads           # 1186.076 M/sec                    (38.46%)
   47,749,307,910      L1-dcache-load-misses     #    7.90% of all L1-dcache hits    (38.47%)
   19,724,956,845      LLC-loads                 #   38.712 M/sec                    (30.78%)
    3,349,412,068      LLC-load-misses           #   16.98% of all LL-cache hits     (30.77%)
  <not supported>      L1-icache-loads                                          
      129,878,634      L1-icache-load-misses                                         (30.77%)
  604,482,046,140      dTLB-loads                # 1186.353 M/sec                    (30.77%)
    4,596,384,416      dTLB-load-misses          #    0.76% of all dTLB cache hits   (30.77%)
        2,493,696      iTLB-loads                #    0.005 M/sec                    (30.77%)
       21,356,368      iTLB-load-misses          #  856.41% of all iTLB cache hits   (30.76%)
  <not supported>      L1-dcache-prefetches                                     
  <not supported>      L1-dcache-prefetch-misses                                

    509.843595752 seconds time elapsed

    507.706093000 seconds user
      1.839848000 seconds sys

VTune 放大器给了我一个函数提示：https://pasteboard.co/IagrLaF.png

指令cmpq似乎占了整个时间的22%。另一方面，其他指令花费的时间可以忽略不计。

perf 给我的画面有些不同，但我认为结果是一致的：

 Percent│       bool mapFound = false;
   0.00 │       movb   [=11=]x0,0x7(%rsp)
        │     goDownBwt():
        │       bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
   0.00 │       lea    0x20(%rsp),%r12
        │         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   0.00 │       lea    (%rax,%rax,2),%rax
   0.00 │       shl    [=11=]x3,%rax
   0.00 │       mov    %rax,0x18(%rsp)
   0.01 │       movzwl %dx,%eax
   0.00 │       mov    %eax,(%rsp)
   0.00 │     ↓ jmp    d6
        │       nop
        │       if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
   0.30 │ 88:   mov    (%rax),%rsi
   8.38 │       mov    0x10(%rsi),%rcx
   0.62 │       test   %rcx,%rcx
   0.15 │     ↓ je     1b0
        │         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   2.05 │       add    0x18(%rsp),%rcx
        │         ++stats->nDownPreprocessed;
   0.25 │       addq   [=11=]x1,0x18(%rdx)
        │         newState->trace                = PREPROCESSED;
   0.98 │       movb   [=11=]x10,0x30(%rsp)
        │         return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
  43.36 │       mov    0x8(%rcx),%rax
   2.61 │       cmp    %rax,(%rcx)
        │         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   0.05 │       mov    %rcx,0x20(%rsp)
        │         return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
   3.47 │       setbe  %dl

函数是

inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
  ++stats->nDown;
  if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
    ++stats->nDownPreprocessed;
    newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
    newState->trace                = PREPROCESSED;
    return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
  }
  bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
  newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
  newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
  newState->trace = 0;
  return (newState->interval.k <= newState->interval.l);
}

state_t 定义为

struct state_t {
  union {
    bwtinterval_t interval;
    preprocessedInterval_t *preprocessedInterval;
  };
  unsigned char trace;
  struct state_t *previousState;
};

preprocessedInterval_t 是：

struct preprocessedInterval_t {
  bwtinterval_t interval;
  preprocessedInterval_t *firstChild;
};

很少有 (~1000) state_t 个结构。但是，有许多 (350k) preprocessedInterval_t 个对象，分配在其他地方。

第一个 if 在 190 亿次中有 150 亿次为真。

在函数上用 perf record -e branches,branch-misses mytool 找到错误预测的分支给我：

Available samples
2M branches                                                                                                                                                                                                       
1M branch-misses

我可以假设分支预测错误是导致这种速度下降的原因吗？优化我的代码的下一步是什么？

代码可在 GitHub

编辑 1

valgrind --tool=cachegrind 给我：

I   refs:      1,893,716,274,393
I1  misses:            4,702,494
LLi misses:              137,142
I1  miss rate:              0.00%
LLi miss rate:              0.00%

D   refs:        756,774,557,235  (602,597,601,611 rd   + 154,176,955,624 wr)
D1  misses:       39,489,866,187  ( 33,583,272,379 rd   +   5,906,593,808 wr)
LLd misses:        3,483,920,786  (  3,379,118,877 rd   +     104,801,909 wr)
D1  miss rate:               5.2% (            5.6%     +             3.8%  )
LLd miss rate:               0.5% (            0.6%     +             0.1%  )

LL refs:          39,494,568,681  ( 33,587,974,873 rd   +   5,906,593,808 wr)
LL misses:         3,484,057,928  (  3,379,256,019 rd   +     104,801,909 wr)
LL miss rate:                0.1% (            0.1%     +             0.1%  )

编辑 2

我用-O3 -DNDEBUG -march=native -fprofile-use编译，使用命令perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out

    508322.348021      task-clock (msec)         #    0.998 CPUs utilized
           21,592      context-switches          #    0.042 K/sec
               33      cpu-migrations            #    0.000 K/sec
        1,305,885      page-faults               #    0.003 M/sec
1,978,382,746,597      cycles                    #    3.892 GHz                      (44.44%)
  228,898,532,311      branches                  #  450.302 M/sec                    (44.45%)
   12,816,920,039      branch-misses             #    5.60% of all branches          (44.45%)
1,867,947,557,739      instructions              #    0.94  insn per cycle           (55.56%)
2,957,085,686,275      uops_issued.any           # 5817.343 M/sec                    (55.56%)
2,864,257,274,102      uops_executed.thread      # 5634.726 M/sec                    (55.56%)
    2,490,571,629      mem_load_uops_retired.l3_miss #    4.900 M/sec                    (55.55%)
   12,482,683,638      mem_load_uops_retired.l2_miss #   24.557 M/sec                    (55.55%)
   18,634,558,602      mem_load_uops_retired.l1_miss #   36.659 M/sec                    (44.44%)

    509.210162391 seconds time elapsed

    506.213075000 seconds user
      2.147749000 seconds sys

编辑 3

我选择了提到我函数的perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out的结果：

Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead  Command     Shared Object       Symbol
  49.76%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead  Command     Shared Object       Symbol
  49.36%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead  Command     Shared Object       Symbol
  43.01%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead  Command     Shared Object       Symbol
  50.32%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead  Command     Shared Object       Symbol
  34.85%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead  Command     Shared Object       Symbol
  42.19%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead  Command     Shared Object       Symbol
  48.14%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead  Command     Shared Object      Symbol
  33.24%  srnaMapper  srnaMapper         [.] addState
  31.00%  srnaMapper  srnaMapper         [.] mapWithoutError

Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead  Command     Shared Object       Symbol
  40.79%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead  Command     Shared Object       Symbol
  39.11%  srnaMapper  srnaMapper          [.] mapWithoutError

（使用 perf record --period 10000 触发 Workload failed: No such file or directory。）

Answer 1

分支和分支未命中的采样率是否相同？ 50% 的错误预测率将是非常糟糕的。

https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate 解释说内核会动态调整每个计数器的周期，因此事件触发的频率足以获得足够的样本，即使对于罕见事件也是如此，但是您可以设置周期（有多少原始计数触发样本）我认为这就是 perf record --period 10000 的作用，但我没有使用过它。

使用perf stat获得硬数。更新：是的，您的 perf stat 结果证实您的分支预测错误率“仅为”5%，而不是 50%，至少对于整个程序而言是这样。这仍然比你想要的要高（分支通常很频繁，错误预测很昂贵）但并不疯狂。

还有 L1d 的缓存未命中率，也许 mem_load_retired.l3_miss（and/or l2_miss 和 l1_miss），看看它是否真的是丢失的负载。例如

perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,\
uops_issued.any,uops_executed.thread,\
mem_load_retired.l3_miss,mem_load_retired.l2_miss,mem_load_retired.l1_miss  ./a.out

您可以将这些事件中的任何一个与 perf record 一起使用，以获得一些统计样本，了解哪些指令导致缓存未命中。这些是精确的事件（使用 PEBS），因此应该准确地映射到正确的指令（不像“周期”，其中计数归因于附近的一些指令，通常是在 ROB 已满时停止等待输入的指令，而不是制作它的速度很慢。）

对于非 PEBS 事件没有任何偏差，这些事件应该“归咎于”单个指令但并不总是在正确的位置中断。

如果您正在针对本地计算机进行优化并且不需要它在其他任何地方运行，您可以使用 -O3 -march=native。但这对缓存未命中有帮助。

GCC profile-guided 优化可以帮助它选择分支还是无分支。（gcc -O3 -march=native -fprofile-generate / 运行它与一些实际的输入数据一起生成配置文件输出 / gcc -O3 -march=native -fprofile-use）

Can I assume that branch misprediction is responsible for this slow down?

不，缓存未命中的可能性更大。您有大量的 L3 未命中，并且一直到 DRAM 花费数百个核心时钟周期。分支预测可以隐藏其中的一些 if 它预测正确。

What would be the next step to optimize my code?

尽可能压缩您的数据结构，以便更多的数据结构适合缓存，例如32 位指针（Linux x32 ABI：gcc -mx32）如果您不需要超过 4GiB 的虚拟地址 space。或者尝试在大型数组中使用 32 位无符号索引而不是原始指针，但加载使用延迟稍差（在 Sandybridge 系列上有几个周期。）

和/或改进您的访问模式，因此您主要是按顺序访问它们。因此硬件预取可以在您需要读取它们之前将它们放入缓存中。

我对 https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform 或其在序列比对中的应用不够熟悉，不知道是否有可能提高它的效率，但数据压缩本质上是有问题的，因为你经常需要数据相关的分支并访问分散的数据。不过，与更多的缓存未命中相比，通常值得权衡。

如何减少在一条指令上花费的时间？

How to decrease the time spent on one instruction?

c

optimization

x86-64

perf