如何减少在一条指令上花费的时间?
How to decrease the time spent on one instruction?
我正在尝试优化 C 中的代码,一条指令似乎占用了大约 22% 的时间。
代码是用 gcc 8.2.0 编译的。标志是 -O3 -DNDEBUG -g
和 -Wall -Wextra -Weffc++ -pthread -lrt
.
509529.517218 task-clock (msec) # 0.999 CPUs utilized
6,234 context-switches # 0.012 K/sec
10 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,985,640,853,831 cycles # 3.897 GHz (30.76%)
1,897,574,410,921 instructions # 0.96 insn per cycle (38.46%)
229,365,727,020 branches # 450.152 M/sec (38.46%)
13,027,677,754 branch-misses # 5.68% of all branches (38.46%)
604,340,619,317 L1-dcache-loads # 1186.076 M/sec (38.46%)
47,749,307,910 L1-dcache-load-misses # 7.90% of all L1-dcache hits (38.47%)
19,724,956,845 LLC-loads # 38.712 M/sec (30.78%)
3,349,412,068 LLC-load-misses # 16.98% of all LL-cache hits (30.77%)
<not supported> L1-icache-loads
129,878,634 L1-icache-load-misses (30.77%)
604,482,046,140 dTLB-loads # 1186.353 M/sec (30.77%)
4,596,384,416 dTLB-load-misses # 0.76% of all dTLB cache hits (30.77%)
2,493,696 iTLB-loads # 0.005 M/sec (30.77%)
21,356,368 iTLB-load-misses # 856.41% of all iTLB cache hits (30.76%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
509.843595752 seconds time elapsed
507.706093000 seconds user
1.839848000 seconds sys
VTune 放大器给了我一个函数提示:https://pasteboard.co/IagrLaF.png
指令cmpq
似乎占了整个时间的22%。另一方面,其他指令花费的时间可以忽略不计。
perf
给我的画面有些不同,但我认为结果是一致的:
Percent│ bool mapFound = false;
0.00 │ movb [=11=]x0,0x7(%rsp)
│ goDownBwt():
│ bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
0.00 │ lea 0x20(%rsp),%r12
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.00 │ lea (%rax,%rax,2),%rax
0.00 │ shl [=11=]x3,%rax
0.00 │ mov %rax,0x18(%rsp)
0.01 │ movzwl %dx,%eax
0.00 │ mov %eax,(%rsp)
0.00 │ ↓ jmp d6
│ nop
│ if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
0.30 │ 88: mov (%rax),%rsi
8.38 │ mov 0x10(%rsi),%rcx
0.62 │ test %rcx,%rcx
0.15 │ ↓ je 1b0
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
2.05 │ add 0x18(%rsp),%rcx
│ ++stats->nDownPreprocessed;
0.25 │ addq [=11=]x1,0x18(%rdx)
│ newState->trace = PREPROCESSED;
0.98 │ movb [=11=]x10,0x30(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
43.36 │ mov 0x8(%rcx),%rax
2.61 │ cmp %rax,(%rcx)
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.05 │ mov %rcx,0x20(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
3.47 │ setbe %dl
函数是
inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
++stats->nDown;
if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
++stats->nDownPreprocessed;
newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
newState->trace = PREPROCESSED;
return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
}
bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
newState->trace = 0;
return (newState->interval.k <= newState->interval.l);
}
state_t
定义为
struct state_t {
union {
bwtinterval_t interval;
preprocessedInterval_t *preprocessedInterval;
};
unsigned char trace;
struct state_t *previousState;
};
preprocessedInterval_t
是:
struct preprocessedInterval_t {
bwtinterval_t interval;
preprocessedInterval_t *firstChild;
};
很少有 (~1000) state_t
个结构。但是,有许多 (350k) preprocessedInterval_t
个对象,分配在其他地方。
第一个 if
在 190 亿次中有 150 亿次为真。
在函数上用 perf record -e branches,branch-misses mytool
找到错误预测的分支给我:
Available samples
2M branches
1M branch-misses
我可以假设分支预测错误是导致这种速度下降的原因吗?
优化我的代码的下一步是什么?
代码可在 GitHub
编辑 1
valgrind --tool=cachegrind
给我:
I refs: 1,893,716,274,393
I1 misses: 4,702,494
LLi misses: 137,142
I1 miss rate: 0.00%
LLi miss rate: 0.00%
D refs: 756,774,557,235 (602,597,601,611 rd + 154,176,955,624 wr)
D1 misses: 39,489,866,187 ( 33,583,272,379 rd + 5,906,593,808 wr)
LLd misses: 3,483,920,786 ( 3,379,118,877 rd + 104,801,909 wr)
D1 miss rate: 5.2% ( 5.6% + 3.8% )
LLd miss rate: 0.5% ( 0.6% + 0.1% )
LL refs: 39,494,568,681 ( 33,587,974,873 rd + 5,906,593,808 wr)
LL misses: 3,484,057,928 ( 3,379,256,019 rd + 104,801,909 wr)
LL miss rate: 0.1% ( 0.1% + 0.1% )
编辑 2
我用-O3 -DNDEBUG -march=native -fprofile-use
编译,使用命令perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out
508322.348021 task-clock (msec) # 0.998 CPUs utilized
21,592 context-switches # 0.042 K/sec
33 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,978,382,746,597 cycles # 3.892 GHz (44.44%)
228,898,532,311 branches # 450.302 M/sec (44.45%)
12,816,920,039 branch-misses # 5.60% of all branches (44.45%)
1,867,947,557,739 instructions # 0.94 insn per cycle (55.56%)
2,957,085,686,275 uops_issued.any # 5817.343 M/sec (55.56%)
2,864,257,274,102 uops_executed.thread # 5634.726 M/sec (55.56%)
2,490,571,629 mem_load_uops_retired.l3_miss # 4.900 M/sec (55.55%)
12,482,683,638 mem_load_uops_retired.l2_miss # 24.557 M/sec (55.55%)
18,634,558,602 mem_load_uops_retired.l1_miss # 36.659 M/sec (44.44%)
509.210162391 seconds time elapsed
506.213075000 seconds user
2.147749000 seconds sys
编辑 3
我选择了提到我函数的perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out
的结果:
Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead Command Shared Object Symbol
49.76% srnaMapper srnaMapper [.] mapWithoutError
Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead Command Shared Object Symbol
49.36% srnaMapper srnaMapper [.] mapWithoutError
Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead Command Shared Object Symbol
43.01% srnaMapper srnaMapper [.] mapWithoutError
Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead Command Shared Object Symbol
50.32% srnaMapper srnaMapper [.] mapWithoutError
Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead Command Shared Object Symbol
34.85% srnaMapper srnaMapper [.] mapWithoutError
Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead Command Shared Object Symbol
42.19% srnaMapper srnaMapper [.] mapWithoutError
Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead Command Shared Object Symbol
48.14% srnaMapper srnaMapper [.] mapWithoutError
Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead Command Shared Object Symbol
33.24% srnaMapper srnaMapper [.] addState
31.00% srnaMapper srnaMapper [.] mapWithoutError
Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead Command Shared Object Symbol
40.79% srnaMapper srnaMapper [.] mapWithoutError
Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead Command Shared Object Symbol
39.11% srnaMapper srnaMapper [.] mapWithoutError
(使用 perf record --period 10000
触发 Workload failed: No such file or directory
。)
分支和分支未命中的采样率是否相同? 50% 的错误预测率将是非常糟糕的。
https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate 解释说内核会动态调整每个计数器的周期,因此事件触发的频率足以获得足够的样本,即使对于罕见事件也是如此,但是您可以设置周期(有多少原始计数触发样本)我认为这就是 perf record --period 10000
的作用,但我没有使用过它。
使用perf stat
获得硬数。更新:是的,您的 perf stat
结果证实您的分支预测错误率“仅为”5%,而不是 50%,至少对于整个程序而言是这样。这仍然比你想要的要高(分支通常很频繁,错误预测很昂贵)但并不疯狂。
还有 L1d 的缓存未命中率,也许 mem_load_retired.l3_miss
(and/or l2_miss
和 l1_miss
),看看它是否真的是丢失的负载。例如
perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,\
uops_issued.any,uops_executed.thread,\
mem_load_retired.l3_miss,mem_load_retired.l2_miss,mem_load_retired.l1_miss ./a.out
您可以将这些事件中的任何一个与 perf record
一起使用,以获得一些统计样本,了解哪些指令导致缓存未命中。这些是精确的事件(使用 PEBS),因此应该准确地映射到正确的指令(不像“周期”,其中计数归因于附近的一些指令,通常是在 ROB 已满时停止等待输入的指令,而不是制作它的速度很慢。)
对于非 PEBS 事件没有任何偏差,这些事件应该“归咎于”单个指令但并不总是在正确的位置中断。
如果您正在针对本地计算机进行优化并且不需要它在其他任何地方 运行,您可以使用 -O3 -march=native
。但这对缓存未命中有帮助。
GCC profile-guided 优化可以帮助它选择分支还是无分支。 (gcc -O3 -march=native -fprofile-generate
/ 运行 它与一些实际的输入数据一起生成配置文件输出 / gcc -O3 -march=native -fprofile-use
)
Can I assume that branch misprediction is responsible for this slow down?
不,缓存未命中的可能性更大。您有大量的 L3 未命中,并且一直到 DRAM 花费数百个核心时钟周期。分支预测可以隐藏其中的一些 if 它预测正确。
What would be the next step to optimize my code?
尽可能压缩您的数据结构,以便更多的数据结构适合缓存,例如32 位指针(Linux x32 ABI:gcc -mx32
)如果您不需要超过 4GiB 的虚拟地址 space。或者尝试在大型数组中使用 32 位无符号索引而不是原始指针,但加载使用延迟稍差(在 Sandybridge 系列上有几个周期。)
和/或改进您的访问模式,因此您主要是按顺序访问它们。因此硬件预取可以在您需要读取它们之前将它们放入缓存中。
我对 https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform 或其在序列比对中的应用不够熟悉,不知道是否有可能提高它的效率,但数据压缩本质上是有问题的,因为你经常需要数据相关的分支并访问分散的数据。不过,与更多的缓存未命中相比,通常值得权衡。
我正在尝试优化 C 中的代码,一条指令似乎占用了大约 22% 的时间。
代码是用 gcc 8.2.0 编译的。标志是 -O3 -DNDEBUG -g
和 -Wall -Wextra -Weffc++ -pthread -lrt
.
509529.517218 task-clock (msec) # 0.999 CPUs utilized
6,234 context-switches # 0.012 K/sec
10 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,985,640,853,831 cycles # 3.897 GHz (30.76%)
1,897,574,410,921 instructions # 0.96 insn per cycle (38.46%)
229,365,727,020 branches # 450.152 M/sec (38.46%)
13,027,677,754 branch-misses # 5.68% of all branches (38.46%)
604,340,619,317 L1-dcache-loads # 1186.076 M/sec (38.46%)
47,749,307,910 L1-dcache-load-misses # 7.90% of all L1-dcache hits (38.47%)
19,724,956,845 LLC-loads # 38.712 M/sec (30.78%)
3,349,412,068 LLC-load-misses # 16.98% of all LL-cache hits (30.77%)
<not supported> L1-icache-loads
129,878,634 L1-icache-load-misses (30.77%)
604,482,046,140 dTLB-loads # 1186.353 M/sec (30.77%)
4,596,384,416 dTLB-load-misses # 0.76% of all dTLB cache hits (30.77%)
2,493,696 iTLB-loads # 0.005 M/sec (30.77%)
21,356,368 iTLB-load-misses # 856.41% of all iTLB cache hits (30.76%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
509.843595752 seconds time elapsed
507.706093000 seconds user
1.839848000 seconds sys
VTune 放大器给了我一个函数提示:https://pasteboard.co/IagrLaF.png
指令cmpq
似乎占了整个时间的22%。另一方面,其他指令花费的时间可以忽略不计。
perf
给我的画面有些不同,但我认为结果是一致的:
Percent│ bool mapFound = false;
0.00 │ movb [=11=]x0,0x7(%rsp)
│ goDownBwt():
│ bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
0.00 │ lea 0x20(%rsp),%r12
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.00 │ lea (%rax,%rax,2),%rax
0.00 │ shl [=11=]x3,%rax
0.00 │ mov %rax,0x18(%rsp)
0.01 │ movzwl %dx,%eax
0.00 │ mov %eax,(%rsp)
0.00 │ ↓ jmp d6
│ nop
│ if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
0.30 │ 88: mov (%rax),%rsi
8.38 │ mov 0x10(%rsi),%rcx
0.62 │ test %rcx,%rcx
0.15 │ ↓ je 1b0
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
2.05 │ add 0x18(%rsp),%rcx
│ ++stats->nDownPreprocessed;
0.25 │ addq [=11=]x1,0x18(%rdx)
│ newState->trace = PREPROCESSED;
0.98 │ movb [=11=]x10,0x30(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
43.36 │ mov 0x8(%rcx),%rax
2.61 │ cmp %rax,(%rcx)
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.05 │ mov %rcx,0x20(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
3.47 │ setbe %dl
函数是
inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
++stats->nDown;
if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
++stats->nDownPreprocessed;
newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
newState->trace = PREPROCESSED;
return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
}
bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
newState->trace = 0;
return (newState->interval.k <= newState->interval.l);
}
state_t
定义为
struct state_t {
union {
bwtinterval_t interval;
preprocessedInterval_t *preprocessedInterval;
};
unsigned char trace;
struct state_t *previousState;
};
preprocessedInterval_t
是:
struct preprocessedInterval_t {
bwtinterval_t interval;
preprocessedInterval_t *firstChild;
};
很少有 (~1000) state_t
个结构。但是,有许多 (350k) preprocessedInterval_t
个对象,分配在其他地方。
第一个 if
在 190 亿次中有 150 亿次为真。
在函数上用 perf record -e branches,branch-misses mytool
找到错误预测的分支给我:
Available samples
2M branches
1M branch-misses
我可以假设分支预测错误是导致这种速度下降的原因吗? 优化我的代码的下一步是什么?
代码可在 GitHub
编辑 1
valgrind --tool=cachegrind
给我:
I refs: 1,893,716,274,393
I1 misses: 4,702,494
LLi misses: 137,142
I1 miss rate: 0.00%
LLi miss rate: 0.00%
D refs: 756,774,557,235 (602,597,601,611 rd + 154,176,955,624 wr)
D1 misses: 39,489,866,187 ( 33,583,272,379 rd + 5,906,593,808 wr)
LLd misses: 3,483,920,786 ( 3,379,118,877 rd + 104,801,909 wr)
D1 miss rate: 5.2% ( 5.6% + 3.8% )
LLd miss rate: 0.5% ( 0.6% + 0.1% )
LL refs: 39,494,568,681 ( 33,587,974,873 rd + 5,906,593,808 wr)
LL misses: 3,484,057,928 ( 3,379,256,019 rd + 104,801,909 wr)
LL miss rate: 0.1% ( 0.1% + 0.1% )
编辑 2
我用-O3 -DNDEBUG -march=native -fprofile-use
编译,使用命令perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out
508322.348021 task-clock (msec) # 0.998 CPUs utilized
21,592 context-switches # 0.042 K/sec
33 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,978,382,746,597 cycles # 3.892 GHz (44.44%)
228,898,532,311 branches # 450.302 M/sec (44.45%)
12,816,920,039 branch-misses # 5.60% of all branches (44.45%)
1,867,947,557,739 instructions # 0.94 insn per cycle (55.56%)
2,957,085,686,275 uops_issued.any # 5817.343 M/sec (55.56%)
2,864,257,274,102 uops_executed.thread # 5634.726 M/sec (55.56%)
2,490,571,629 mem_load_uops_retired.l3_miss # 4.900 M/sec (55.55%)
12,482,683,638 mem_load_uops_retired.l2_miss # 24.557 M/sec (55.55%)
18,634,558,602 mem_load_uops_retired.l1_miss # 36.659 M/sec (44.44%)
509.210162391 seconds time elapsed
506.213075000 seconds user
2.147749000 seconds sys
编辑 3
我选择了提到我函数的perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out
的结果:
Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead Command Shared Object Symbol
49.76% srnaMapper srnaMapper [.] mapWithoutError
Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead Command Shared Object Symbol
49.36% srnaMapper srnaMapper [.] mapWithoutError
Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead Command Shared Object Symbol
43.01% srnaMapper srnaMapper [.] mapWithoutError
Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead Command Shared Object Symbol
50.32% srnaMapper srnaMapper [.] mapWithoutError
Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead Command Shared Object Symbol
34.85% srnaMapper srnaMapper [.] mapWithoutError
Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead Command Shared Object Symbol
42.19% srnaMapper srnaMapper [.] mapWithoutError
Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead Command Shared Object Symbol
48.14% srnaMapper srnaMapper [.] mapWithoutError
Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead Command Shared Object Symbol
33.24% srnaMapper srnaMapper [.] addState
31.00% srnaMapper srnaMapper [.] mapWithoutError
Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead Command Shared Object Symbol
40.79% srnaMapper srnaMapper [.] mapWithoutError
Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead Command Shared Object Symbol
39.11% srnaMapper srnaMapper [.] mapWithoutError
(使用 perf record --period 10000
触发 Workload failed: No such file or directory
。)
分支和分支未命中的采样率是否相同? 50% 的错误预测率将是非常糟糕的。
https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate 解释说内核会动态调整每个计数器的周期,因此事件触发的频率足以获得足够的样本,即使对于罕见事件也是如此,但是您可以设置周期(有多少原始计数触发样本)我认为这就是 perf record --period 10000
的作用,但我没有使用过它。
使用perf stat
获得硬数。更新:是的,您的 perf stat
结果证实您的分支预测错误率“仅为”5%,而不是 50%,至少对于整个程序而言是这样。这仍然比你想要的要高(分支通常很频繁,错误预测很昂贵)但并不疯狂。
还有 L1d 的缓存未命中率,也许 mem_load_retired.l3_miss
(and/or l2_miss
和 l1_miss
),看看它是否真的是丢失的负载。例如
perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,\
uops_issued.any,uops_executed.thread,\
mem_load_retired.l3_miss,mem_load_retired.l2_miss,mem_load_retired.l1_miss ./a.out
您可以将这些事件中的任何一个与 perf record
一起使用,以获得一些统计样本,了解哪些指令导致缓存未命中。这些是精确的事件(使用 PEBS),因此应该准确地映射到正确的指令(不像“周期”,其中计数归因于附近的一些指令,通常是在 ROB 已满时停止等待输入的指令,而不是制作它的速度很慢。)
对于非 PEBS 事件没有任何偏差,这些事件应该“归咎于”单个指令但并不总是在正确的位置中断。
如果您正在针对本地计算机进行优化并且不需要它在其他任何地方 运行,您可以使用 -O3 -march=native
。但这对缓存未命中有帮助。
GCC profile-guided 优化可以帮助它选择分支还是无分支。 (gcc -O3 -march=native -fprofile-generate
/ 运行 它与一些实际的输入数据一起生成配置文件输出 / gcc -O3 -march=native -fprofile-use
)
Can I assume that branch misprediction is responsible for this slow down?
不,缓存未命中的可能性更大。您有大量的 L3 未命中,并且一直到 DRAM 花费数百个核心时钟周期。分支预测可以隐藏其中的一些 if 它预测正确。
What would be the next step to optimize my code?
尽可能压缩您的数据结构,以便更多的数据结构适合缓存,例如32 位指针(Linux x32 ABI:gcc -mx32
)如果您不需要超过 4GiB 的虚拟地址 space。或者尝试在大型数组中使用 32 位无符号索引而不是原始指针,但加载使用延迟稍差(在 Sandybridge 系列上有几个周期。)
和/或改进您的访问模式,因此您主要是按顺序访问它们。因此硬件预取可以在您需要读取它们之前将它们放入缓存中。
我对 https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform 或其在序列比对中的应用不够熟悉,不知道是否有可能提高它的效率,但数据压缩本质上是有问题的,因为你经常需要数据相关的分支并访问分散的数据。不过,与更多的缓存未命中相比,通常值得权衡。