错误的基准,令人费解的组装
Faulty benchmark, puzzling assembly
这里是汇编新手。我编写了一个基准测试来衡量机器在计算转置矩阵张量积时的浮点性能。
鉴于我的机器配备 32GiB RAM(带宽 ~37GiB/s)和 Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz(Turbo 4.0GHz)处理器,我估计最大性能(使用流水线和寄存器中的数据)为 6 核 x 4.0GHz = 24GFLOP/s。但是,当我 运行 我的基准测试时,我测量的是 127GFLOP/s,这显然是一个错误的测量值。
注意:为了测量 FP 性能,我正在测量操作数:n*n*n*n*6
(n^3
用于矩阵-矩阵乘法,在 n
个复杂的切片上执行数据点,即假设 6 个 FLOPs 用于 1 个复杂的复数乘法)并将其除以每个 运行.
所花费的平均时间
主函数中的代码片段:
// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
#pragma noinline
do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);
代码片段:do_timed_run:
void do_timed_run(const std::size_t& n, double& avg_dur)
{
// create the data and lay first touch
auto operand0 = matrix<double>(n, n);
auto operand1 = tensor<double>(n, n, n);
auto result = tensor<double>(n, n, n);
// first touch
#pragma omp parallel
{
set_first_touch(operand1);
set_first_touch(result);
}
// do the experiment
const auto dur1 = omp_get_wtime() * 1E+6;
#pragma omp parallel firstprivate(operand0)
{
#pragma noinline
transp_matrix_tensor_mult(operand0, operand1, result);
}
const auto dur2 = omp_get_wtime() * 1E+6;
avg_dur += dur2 - dur1;
}
备注:
- 此时,我不提供函数
transp_matrix_tensor_mult
的代码,因为我认为它不相关。
#pragma noinline
是我用来更好地理解反汇编程序输出的调试装置。
现在开始反汇编函数 do_timed_run
:
0000000000403a20 <_Z12do_timed_runRKmRd>:
403a20: 48 81 ec d8 00 00 00 sub [=12=]xd8,%rsp
403a27: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403a2e: 00
403a2f: 48 89 fd mov %rdi,%rbp
403a32: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403a39: 00
403a3a: 48 89 f3 mov %rsi,%rbx
403a3d: 48 89 ee mov %rbp,%rsi
403a40: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403a45: 48 89 ea mov %rbp,%rdx
403a48: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403a4f: 00
403a50: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403a57: 00
403a58: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403a5f: 00
403a60: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403a67: 00
403a68: e8 03 f8 ff ff callq 403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
403a6d: 48 89 ee mov %rbp,%rsi
403a70: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403a75: 48 89 ea mov %rbp,%rdx
403a78: 48 89 e9 mov %rbp,%rcx
403a7b: e8 80 f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a80: 48 89 ee mov %rbp,%rsi
403a83: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403a88: 48 89 ea mov %rbp,%rdx
403a8b: 48 89 e9 mov %rbp,%rcx
403a8e: e8 6d f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a93: bf 88 f3 44 00 mov [=12=]x44f388,%edi
403a98: e8 53 f7 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403a9d: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403aa4: bf c0 f3 44 00 mov [=12=]x44f3c0,%edi
403aa9: 33 c0 xor %eax,%eax
403aab: e8 20 f6 ff ff callq 4030d0 <__kmpc_ok_to_fork@plt>
403ab0: 85 c0 test %eax,%eax
403ab2: 74 21 je 403ad5 <_Z12do_timed_runRKmRd+0xb5>
403ab4: ba a5 3c 40 00 mov [=12=]x403ca5,%edx
403ab9: bf c0 f3 44 00 mov [=12=]x44f3c0,%edi
403abe: be 02 00 00 00 mov [=12=]x2,%esi
403ac3: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx
403ac8: 33 c0 xor %eax,%eax
403aca: 4c 8d 41 38 lea 0x38(%rcx),%r8
403ace: e8 cd f5 ff ff callq 4030a0 <__kmpc_fork_call@plt>
403ad3: eb 41 jmp 403b16 <_Z12do_timed_runRKmRd+0xf6>
403ad5: bf c0 f3 44 00 mov [=12=]x44f3c0,%edi
403ada: 33 c0 xor %eax,%eax
403adc: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403ae3: e8 58 f7 ff ff callq 403240 <__kmpc_serialized_parallel@plt>
403ae8: be 9c 13 47 00 mov [=12=]x47139c,%esi
403aed: 48 8d bc 24 d0 00 00 lea 0xd0(%rsp),%rdi
403af4: 00
403af5: 48 8d 54 24 08 lea 0x8(%rsp),%rdx
403afa: 48 8d 4a 38 lea 0x38(%rdx),%rcx
403afe: e8 a2 01 00 00 callq 403ca5 <_Z12do_timed_runRKmRd+0x285>
403b03: bf c0 f3 44 00 mov [=12=]x44f3c0,%edi
403b08: 33 c0 xor %eax,%eax
403b0a: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b11: e8 aa f7 ff ff callq 4032c0 <__kmpc_end_serialized_parallel@plt>
403b16: e8 85 f6 ff ff callq 4031a0 <omp_get_wtime@plt>
403b1b: c5 fb 11 04 24 vmovsd %xmm0,(%rsp)
403b20: bf f8 f3 44 00 mov [=12=]x44f3f8,%edi
403b25: 33 c0 xor %eax,%eax
403b27: e8 a4 f5 ff ff callq 4030d0 <__kmpc_ok_to_fork@plt>
403b2c: 85 c0 test %eax,%eax
403b2e: 74 25 je 403b55 <_Z12do_timed_runRKmRd+0x135>
403b30: ba 0b 3c 40 00 mov [=12=]x403c0b,%edx
403b35: bf f8 f3 44 00 mov [=12=]x44f3f8,%edi
403b3a: be 03 00 00 00 mov [=12=]x3,%esi
403b3f: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx
403b44: 33 c0 xor %eax,%eax
403b46: 4c 8d 41 38 lea 0x38(%rcx),%r8
403b4a: 4c 8d 49 70 lea 0x70(%rcx),%r9
403b4e: e8 4d f5 ff ff callq 4030a0 <__kmpc_fork_call@plt>
403b53: eb 45 jmp 403b9a <_Z12do_timed_runRKmRd+0x17a>
403b55: bf f8 f3 44 00 mov [=12=]x44f3f8,%edi
403b5a: 33 c0 xor %eax,%eax
403b5c: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b63: e8 d8 f6 ff ff callq 403240 <__kmpc_serialized_parallel@plt>
403b68: be a0 13 47 00 mov [=12=]x4713a0,%esi
403b6d: 48 8d bc 24 d0 00 00 lea 0xd0(%rsp),%rdi
403b74: 00
403b75: 48 8d 54 24 08 lea 0x8(%rsp),%rdx
403b7a: 48 8d 4a 38 lea 0x38(%rdx),%rcx
403b7e: 4c 8d 42 70 lea 0x70(%rdx),%r8
403b82: e8 84 00 00 00 callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
403b87: bf f8 f3 44 00 mov [=12=]x44f3f8,%edi
403b8c: 33 c0 xor %eax,%eax
403b8e: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b95: e8 26 f7 ff ff callq 4032c0 <__kmpc_end_serialized_parallel@plt>
403b9a: e8 01 f6 ff ff callq 4031a0 <omp_get_wtime@plt>
403b9f: c5 fb 5c 0c 24 vsubsd (%rsp),%xmm0,%xmm1
403ba4: c5 fb 10 05 cc c4 01 vmovsd 0x1c4cc(%rip),%xmm0 # 420078 <alpha_beta.61562.0.0.28+0x28>
403bab: 00
403bac: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403bb1: c4 e2 f9 a9 0b vfmadd213sd (%rbx),%xmm0,%xmm1
403bb6: c5 fb 11 0b vmovsd %xmm1,(%rbx)
403bba: e8 71 f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bbf: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403bc4: e8 67 f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bc9: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403bce: e8 5d f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bd3: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403bda: 00
403bdb: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403be2: 00
403be3: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403bea: 00
403beb: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403bf2: 00
403bf3: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403bfa: 00
403bfb: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403c02: 00
403c03: 48 81 c4 d8 00 00 00 add [=12=]xd8,%rsp
403c0a: c3 retq
403c0b: 48 81 ec d8 00 00 00 sub [=12=]xd8,%rsp
403c12: 4c 89 c6 mov %r8,%rsi
403c15: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403c1c: 00
403c1d: 4c 8d 24 24 lea (%rsp),%r12
403c21: 4c 89 e7 mov %r12,%rdi
403c24: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403c2b: 00
403c2c: 48 89 cd mov %rcx,%rbp
403c2f: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403c36: 00
403c37: 48 89 d3 mov %rdx,%rbx
403c3a: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403c41: 00
403c42: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403c49: 00
403c4a: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403c51: 00
403c52: e8 49 03 00 00 callq 403fa0 <_ZN5s3dft6matrixIdEC1ERKS1_> # <--- Here starts the part with the function call...
403c57: 4c 89 e7 mov %r12,%rdi
403c5a: 48 89 de mov %rbx,%rsi
403c5d: 48 89 ea mov %rbp,%rdx
403c60: e8 8b 01 00 00 callq 403df0 <_Z25transp_matrix_tensor_multIdEvRKN5s3dft6matrixIT_EERKNS0_6tensorIS2_EERS7_>
403c65: 4c 89 e7 mov %r12,%rdi
403c68: e8 63 01 00 00 callq 403dd0 <_ZN5s3dft6matrixIdED1Ev> # <--- ...and here it ends
403c6d: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403c74: 00
403c75: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403c7c: 00
403c7d: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403c84: 00
403c85: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403c8c: 00
403c8d: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403c94: 00
403c95: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403c9c: 00
403c9d: 48 81 c4 d8 00 00 00 add [=12=]xd8,%rsp
403ca4: c3 retq
403ca5: 48 81 ec d8 00 00 00 sub [=12=]xd8,%rsp
403cac: 48 89 d7 mov %rdx,%rdi
403caf: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403cb6: 00
403cb7: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403cbe: 00
403cbf: 48 89 cb mov %rcx,%rbx
403cc2: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403cc9: 00
403cca: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403cd1: 00
403cd2: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403cd9: 00
403cda: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403ce1: 00
403ce2: e8 99 f4 ff ff callq 403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt> # <--- here are the calls to set-first-touch
403ce7: 48 89 df mov %rbx,%rdi
403cea: e8 91 f4 ff ff callq 403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt>
403cef: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403cf6: 00
403cf7: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403cfe: 00
403cff: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403d06: 00
403d07: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403d0e: 00
403d0f: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403d16: 00
403d17: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403d1e: 00
403d1f: 48 81 c4 d8 00 00 00 add [=12=]xd8,%rsp
403d26: c3 retq
403d27: 48 89 04 24 mov %rax,(%rsp)
403d2b: bf 30 f4 44 00 mov [=12=]x44f430,%edi
403d30: e8 bb f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d35: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d3c: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403d41: e8 9a 00 00 00 callq 403de0 <_ZN5s3dft6tensorIdED1Ev>
403d46: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403d4b: e8 90 00 00 00 callq 403de0 <_ZN5s3dft6tensorIdED1Ev>
403d50: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403d55: e8 76 00 00 00 callq 403dd0 <_ZN5s3dft6matrixIdED1Ev>
403d5a: 48 8b 3c 24 mov (%rsp),%rdi
403d5e: e8 5d f3 ff ff callq 4030c0 <_Unwind_Resume@plt>
403d63: 48 89 04 24 mov %rax,(%rsp)
403d67: bf 68 f4 44 00 mov [=12=]x44f468,%edi
403d6c: e8 7f f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d71: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d78: eb cc jmp 403d46 <_Z12do_timed_runRKmRd+0x326>
403d7a: 48 89 04 24 mov %rax,(%rsp)
403d7e: bf a0 f4 44 00 mov [=12=]x44f4a0,%edi
403d83: e8 68 f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d88: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d8f: eb bf jmp 403d50 <_Z12do_timed_runRKmRd+0x330>
403d91: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
403d98: 00
403d99: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
主要问题:
- 我假设函数是在 定时区域之外调用的,我的假设是否正确?
- 如果以上是真的,为什么会这样?
- 如果上述情况不正确,我如何找出我的基准测试出错的原因?
次要问题:
- 为什么代码中有无条件跳转(在 403ad3、403b53、403d78 和 403d8f)?
- 为什么同一个函数中有 3 个
retq
个实例,只有一个 return 路径(在 403c0a、403ca4 和 403d26)?
请注意,我只提供了我认为相关的信息。如有需要,我们将很乐意提供更多信息。预先感谢您的宝贵时间。
编辑:
@PeterCordes 我 did 在启用调试符号的情况下构建。上面发布的程序集是使用 objdump
获得的,它以某种方式没有检索到所需的符号。这是使用 icpc
:
获得的程序集的(片段)
# omp_get_wtime()
call omp_get_wtime #122.23
..___tag_value__Z12do_timed_runRKmRd.267:
..LN419:
# LOE rbx xmm0
..B4.12: # Preds ..B4.11
# Execution count [1.00e+00]
..LN420:
vmovsd %xmm0, (%rsp) #122.23[spill]
..LN421:
# LOE rbx
..B4.13: # Preds ..B4.12
# Execution count [1.00e+00]
..LN422:
.loc 1 123 is_stmt 1
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN423:
xorl %eax, %eax #123.5
..___tag_value__Z12do_timed_runRKmRd.269:
..LN424:
call __kmpc_ok_to_fork #123.5
..___tag_value__Z12do_timed_runRKmRd.270:
..LN425:
# LOE rbx eax
..B4.14: # Preds ..B4.13
# Execution count [1.00e+00]
..LN426:
testl %eax, %eax #123.5
..LN427:
je ..B4.17 # Prob 50% #123.5
..LN428:
# LOE rbx
..B4.15: # Preds ..B4.14
# Execution count [0.00e+00]
..LN429:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN430:
xorl %edx, %edx #123.5
..LN431:
incq %rdx #123.5
..LN432:
xorl %eax, %eax #123.5
..LN433:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.271:
..LN434:
call __kmpc_push_num_threads #123.5
..___tag_value__Z12do_timed_runRKmRd.272:
..LN435:
# LOE rbx
..B4.16: # Preds ..B4.15
# Execution count [0.00e+00]
..LN436:
movl $L__Z12do_timed_runRKmRd_123__par_region1_2.5, %edx #123.5
..LN437:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN438:
movl , %esi #123.5
..LN439:
lea 8(%rsp), %rcx #123.5
..LN440:
xorl %eax, %eax #123.5
..LN441:
lea 56(%rcx), %r8 #123.5
..LN442:
lea 112(%rcx), %r9 #123.5
..___tag_value__Z12do_timed_runRKmRd.273:
..LN443:
call __kmpc_fork_call #123.5
..___tag_value__Z12do_timed_runRKmRd.274:
..LN444:
jmp ..B4.20 # Prob 100% #123.5
..LN445:
# LOE rbx
..B4.17: # Preds ..B4.14
# Execution count [0.00e+00]
..LN446:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN447:
xorl %eax, %eax #123.5
..LN448:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.275:
..LN449:
call __kmpc_serialized_parallel #123.5
..___tag_value__Z12do_timed_runRKmRd.276:
..LN450:
# LOE rbx
..B4.18: # Preds ..B4.17
# Execution count [0.00e+00]
..LN451:
movl $___kmpv_zero_Z12do_timed_runRKmRd_1, %esi #123.5
..LN452:
lea 208(%rsp), %rdi #123.5
..LN453:
lea 8(%rsp), %rdx #123.5
..LN454:
lea 56(%rdx), %rcx #123.5
..LN455:
lea 112(%rdx), %r8 #123.5
..___tag_value__Z12do_timed_runRKmRd.277:
..LN456:
call L__Z12do_timed_runRKmRd_123__par_region1_2.5 #123.5
..___tag_value__Z12do_timed_runRKmRd.278:
..LN457:
# LOE rbx
..B4.19: # Preds ..B4.18
# Execution count [0.00e+00]
..LN458:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN459:
xorl %eax, %eax #123.5
..LN460:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.279:
..LN461:
call __kmpc_end_serialized_parallel #123.5
..___tag_value__Z12do_timed_runRKmRd.280:
..LN462:
# LOE rbx
..B4.20: # Preds ..B4.16 ..B4.19
# Execution count [1.00e+00]
..___tag_value__Z12do_timed_runRKmRd.281:
..LN463:
.loc 1 128 is_stmt 1
# omp_get_wtime()
call omp_get_wtime #128.23
如您所见,输出非常冗长且难以阅读。
每个核心时钟周期 1 个 FP 操作对于现代超标量 CPU 来说是可悲的。您的 Skylake-derived CPU 实际上每个内核每个时钟可以执行 2x 4-wide SIMD double-precision FMA 操作,并且每个 FMA 算作两个 FLOP,因此理论上最大值 = 16 double-precision FLOP每个核心时钟,所以 24 * 16 = 384
GFLOP/S。 (使用 4 double
s 的矢量,即 256 位宽的 AVX)。参见 FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
定时区域内有一个a函数调用,callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
(以及__kmpc_end_serialized_parallel
东西)。
没有与该调用目标关联的符号,所以我猜你没有在启用调试信息的情况下进行编译。 (这与优化级别是分开的,例如 gcc -g -O3 -march=native -fopenmp
应该 运行 相同的 asm,只是有更多的调试元数据。)即使是 OpenMP 发明的函数也应该在某些时候关联一个符号名称。
就基准有效性而言,一个好的试金石是它是否能根据问题的大小合理扩展。除非您超过 L3 缓存大小或者没有出现更小或更大的问题,否则时间应该以某种合理的方式更改。如果没有,那么您会担心它会优化掉,或者时钟速度 warm-up 影响( 等等,比如 page-faults。)
- Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?
一旦你已经在 if
块中,你无条件地知道 else
块不应该 运行,所以你 jmp
在它上面而不是 jcc
(即使 FLAGS
仍然设置,所以您不必再次测试条件)。或者你把一个或另一个块 out-of-line (比如在函数的末尾,或在入口点之前)和 jcc
到它,然后它 jmp
s 回到另一个之后边。这允许快速路径在没有分支的情况下是连续的。
- Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?
Duplicate ret
来自“tail duplication”优化,其中多个执行路径,所有 return 都可以获得自己的 ret
而不是跳转到 ret
. (以及任何必要清理的副本,例如恢复 regs 和堆栈指针。)
这里是汇编新手。我编写了一个基准测试来衡量机器在计算转置矩阵张量积时的浮点性能。
鉴于我的机器配备 32GiB RAM(带宽 ~37GiB/s)和 Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz(Turbo 4.0GHz)处理器,我估计最大性能(使用流水线和寄存器中的数据)为 6 核 x 4.0GHz = 24GFLOP/s。但是,当我 运行 我的基准测试时,我测量的是 127GFLOP/s,这显然是一个错误的测量值。
注意:为了测量 FP 性能,我正在测量操作数:n*n*n*n*6
(n^3
用于矩阵-矩阵乘法,在 n
个复杂的切片上执行数据点,即假设 6 个 FLOPs 用于 1 个复杂的复数乘法)并将其除以每个 运行.
主函数中的代码片段:
// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
#pragma noinline
do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);
代码片段:do_timed_run:
void do_timed_run(const std::size_t& n, double& avg_dur)
{
// create the data and lay first touch
auto operand0 = matrix<double>(n, n);
auto operand1 = tensor<double>(n, n, n);
auto result = tensor<double>(n, n, n);
// first touch
#pragma omp parallel
{
set_first_touch(operand1);
set_first_touch(result);
}
// do the experiment
const auto dur1 = omp_get_wtime() * 1E+6;
#pragma omp parallel firstprivate(operand0)
{
#pragma noinline
transp_matrix_tensor_mult(operand0, operand1, result);
}
const auto dur2 = omp_get_wtime() * 1E+6;
avg_dur += dur2 - dur1;
}
备注:
- 此时,我不提供函数
transp_matrix_tensor_mult
的代码,因为我认为它不相关。 #pragma noinline
是我用来更好地理解反汇编程序输出的调试装置。
现在开始反汇编函数 do_timed_run
:
0000000000403a20 <_Z12do_timed_runRKmRd>:
403a20: 48 81 ec d8 00 00 00 sub [=12=]xd8,%rsp
403a27: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403a2e: 00
403a2f: 48 89 fd mov %rdi,%rbp
403a32: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403a39: 00
403a3a: 48 89 f3 mov %rsi,%rbx
403a3d: 48 89 ee mov %rbp,%rsi
403a40: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403a45: 48 89 ea mov %rbp,%rdx
403a48: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403a4f: 00
403a50: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403a57: 00
403a58: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403a5f: 00
403a60: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403a67: 00
403a68: e8 03 f8 ff ff callq 403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
403a6d: 48 89 ee mov %rbp,%rsi
403a70: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403a75: 48 89 ea mov %rbp,%rdx
403a78: 48 89 e9 mov %rbp,%rcx
403a7b: e8 80 f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a80: 48 89 ee mov %rbp,%rsi
403a83: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403a88: 48 89 ea mov %rbp,%rdx
403a8b: 48 89 e9 mov %rbp,%rcx
403a8e: e8 6d f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a93: bf 88 f3 44 00 mov [=12=]x44f388,%edi
403a98: e8 53 f7 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403a9d: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403aa4: bf c0 f3 44 00 mov [=12=]x44f3c0,%edi
403aa9: 33 c0 xor %eax,%eax
403aab: e8 20 f6 ff ff callq 4030d0 <__kmpc_ok_to_fork@plt>
403ab0: 85 c0 test %eax,%eax
403ab2: 74 21 je 403ad5 <_Z12do_timed_runRKmRd+0xb5>
403ab4: ba a5 3c 40 00 mov [=12=]x403ca5,%edx
403ab9: bf c0 f3 44 00 mov [=12=]x44f3c0,%edi
403abe: be 02 00 00 00 mov [=12=]x2,%esi
403ac3: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx
403ac8: 33 c0 xor %eax,%eax
403aca: 4c 8d 41 38 lea 0x38(%rcx),%r8
403ace: e8 cd f5 ff ff callq 4030a0 <__kmpc_fork_call@plt>
403ad3: eb 41 jmp 403b16 <_Z12do_timed_runRKmRd+0xf6>
403ad5: bf c0 f3 44 00 mov [=12=]x44f3c0,%edi
403ada: 33 c0 xor %eax,%eax
403adc: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403ae3: e8 58 f7 ff ff callq 403240 <__kmpc_serialized_parallel@plt>
403ae8: be 9c 13 47 00 mov [=12=]x47139c,%esi
403aed: 48 8d bc 24 d0 00 00 lea 0xd0(%rsp),%rdi
403af4: 00
403af5: 48 8d 54 24 08 lea 0x8(%rsp),%rdx
403afa: 48 8d 4a 38 lea 0x38(%rdx),%rcx
403afe: e8 a2 01 00 00 callq 403ca5 <_Z12do_timed_runRKmRd+0x285>
403b03: bf c0 f3 44 00 mov [=12=]x44f3c0,%edi
403b08: 33 c0 xor %eax,%eax
403b0a: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b11: e8 aa f7 ff ff callq 4032c0 <__kmpc_end_serialized_parallel@plt>
403b16: e8 85 f6 ff ff callq 4031a0 <omp_get_wtime@plt>
403b1b: c5 fb 11 04 24 vmovsd %xmm0,(%rsp)
403b20: bf f8 f3 44 00 mov [=12=]x44f3f8,%edi
403b25: 33 c0 xor %eax,%eax
403b27: e8 a4 f5 ff ff callq 4030d0 <__kmpc_ok_to_fork@plt>
403b2c: 85 c0 test %eax,%eax
403b2e: 74 25 je 403b55 <_Z12do_timed_runRKmRd+0x135>
403b30: ba 0b 3c 40 00 mov [=12=]x403c0b,%edx
403b35: bf f8 f3 44 00 mov [=12=]x44f3f8,%edi
403b3a: be 03 00 00 00 mov [=12=]x3,%esi
403b3f: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx
403b44: 33 c0 xor %eax,%eax
403b46: 4c 8d 41 38 lea 0x38(%rcx),%r8
403b4a: 4c 8d 49 70 lea 0x70(%rcx),%r9
403b4e: e8 4d f5 ff ff callq 4030a0 <__kmpc_fork_call@plt>
403b53: eb 45 jmp 403b9a <_Z12do_timed_runRKmRd+0x17a>
403b55: bf f8 f3 44 00 mov [=12=]x44f3f8,%edi
403b5a: 33 c0 xor %eax,%eax
403b5c: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b63: e8 d8 f6 ff ff callq 403240 <__kmpc_serialized_parallel@plt>
403b68: be a0 13 47 00 mov [=12=]x4713a0,%esi
403b6d: 48 8d bc 24 d0 00 00 lea 0xd0(%rsp),%rdi
403b74: 00
403b75: 48 8d 54 24 08 lea 0x8(%rsp),%rdx
403b7a: 48 8d 4a 38 lea 0x38(%rdx),%rcx
403b7e: 4c 8d 42 70 lea 0x70(%rdx),%r8
403b82: e8 84 00 00 00 callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
403b87: bf f8 f3 44 00 mov [=12=]x44f3f8,%edi
403b8c: 33 c0 xor %eax,%eax
403b8e: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b95: e8 26 f7 ff ff callq 4032c0 <__kmpc_end_serialized_parallel@plt>
403b9a: e8 01 f6 ff ff callq 4031a0 <omp_get_wtime@plt>
403b9f: c5 fb 5c 0c 24 vsubsd (%rsp),%xmm0,%xmm1
403ba4: c5 fb 10 05 cc c4 01 vmovsd 0x1c4cc(%rip),%xmm0 # 420078 <alpha_beta.61562.0.0.28+0x28>
403bab: 00
403bac: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403bb1: c4 e2 f9 a9 0b vfmadd213sd (%rbx),%xmm0,%xmm1
403bb6: c5 fb 11 0b vmovsd %xmm1,(%rbx)
403bba: e8 71 f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bbf: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403bc4: e8 67 f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bc9: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403bce: e8 5d f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bd3: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403bda: 00
403bdb: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403be2: 00
403be3: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403bea: 00
403beb: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403bf2: 00
403bf3: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403bfa: 00
403bfb: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403c02: 00
403c03: 48 81 c4 d8 00 00 00 add [=12=]xd8,%rsp
403c0a: c3 retq
403c0b: 48 81 ec d8 00 00 00 sub [=12=]xd8,%rsp
403c12: 4c 89 c6 mov %r8,%rsi
403c15: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403c1c: 00
403c1d: 4c 8d 24 24 lea (%rsp),%r12
403c21: 4c 89 e7 mov %r12,%rdi
403c24: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403c2b: 00
403c2c: 48 89 cd mov %rcx,%rbp
403c2f: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403c36: 00
403c37: 48 89 d3 mov %rdx,%rbx
403c3a: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403c41: 00
403c42: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403c49: 00
403c4a: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403c51: 00
403c52: e8 49 03 00 00 callq 403fa0 <_ZN5s3dft6matrixIdEC1ERKS1_> # <--- Here starts the part with the function call...
403c57: 4c 89 e7 mov %r12,%rdi
403c5a: 48 89 de mov %rbx,%rsi
403c5d: 48 89 ea mov %rbp,%rdx
403c60: e8 8b 01 00 00 callq 403df0 <_Z25transp_matrix_tensor_multIdEvRKN5s3dft6matrixIT_EERKNS0_6tensorIS2_EERS7_>
403c65: 4c 89 e7 mov %r12,%rdi
403c68: e8 63 01 00 00 callq 403dd0 <_ZN5s3dft6matrixIdED1Ev> # <--- ...and here it ends
403c6d: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403c74: 00
403c75: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403c7c: 00
403c7d: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403c84: 00
403c85: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403c8c: 00
403c8d: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403c94: 00
403c95: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403c9c: 00
403c9d: 48 81 c4 d8 00 00 00 add [=12=]xd8,%rsp
403ca4: c3 retq
403ca5: 48 81 ec d8 00 00 00 sub [=12=]xd8,%rsp
403cac: 48 89 d7 mov %rdx,%rdi
403caf: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403cb6: 00
403cb7: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403cbe: 00
403cbf: 48 89 cb mov %rcx,%rbx
403cc2: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403cc9: 00
403cca: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403cd1: 00
403cd2: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403cd9: 00
403cda: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403ce1: 00
403ce2: e8 99 f4 ff ff callq 403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt> # <--- here are the calls to set-first-touch
403ce7: 48 89 df mov %rbx,%rdi
403cea: e8 91 f4 ff ff callq 403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt>
403cef: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403cf6: 00
403cf7: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403cfe: 00
403cff: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403d06: 00
403d07: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403d0e: 00
403d0f: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403d16: 00
403d17: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403d1e: 00
403d1f: 48 81 c4 d8 00 00 00 add [=12=]xd8,%rsp
403d26: c3 retq
403d27: 48 89 04 24 mov %rax,(%rsp)
403d2b: bf 30 f4 44 00 mov [=12=]x44f430,%edi
403d30: e8 bb f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d35: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d3c: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403d41: e8 9a 00 00 00 callq 403de0 <_ZN5s3dft6tensorIdED1Ev>
403d46: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403d4b: e8 90 00 00 00 callq 403de0 <_ZN5s3dft6tensorIdED1Ev>
403d50: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403d55: e8 76 00 00 00 callq 403dd0 <_ZN5s3dft6matrixIdED1Ev>
403d5a: 48 8b 3c 24 mov (%rsp),%rdi
403d5e: e8 5d f3 ff ff callq 4030c0 <_Unwind_Resume@plt>
403d63: 48 89 04 24 mov %rax,(%rsp)
403d67: bf 68 f4 44 00 mov [=12=]x44f468,%edi
403d6c: e8 7f f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d71: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d78: eb cc jmp 403d46 <_Z12do_timed_runRKmRd+0x326>
403d7a: 48 89 04 24 mov %rax,(%rsp)
403d7e: bf a0 f4 44 00 mov [=12=]x44f4a0,%edi
403d83: e8 68 f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d88: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d8f: eb bf jmp 403d50 <_Z12do_timed_runRKmRd+0x330>
403d91: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
403d98: 00
403d99: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
主要问题:
- 我假设函数是在 定时区域之外调用的,我的假设是否正确?
- 如果以上是真的,为什么会这样?
- 如果上述情况不正确,我如何找出我的基准测试出错的原因?
次要问题:
- 为什么代码中有无条件跳转(在 403ad3、403b53、403d78 和 403d8f)?
- 为什么同一个函数中有 3 个
retq
个实例,只有一个 return 路径(在 403c0a、403ca4 和 403d26)?
请注意,我只提供了我认为相关的信息。如有需要,我们将很乐意提供更多信息。预先感谢您的宝贵时间。
编辑:
@PeterCordes 我 did 在启用调试符号的情况下构建。上面发布的程序集是使用 objdump
获得的,它以某种方式没有检索到所需的符号。这是使用 icpc
:
# omp_get_wtime()
call omp_get_wtime #122.23
..___tag_value__Z12do_timed_runRKmRd.267:
..LN419:
# LOE rbx xmm0
..B4.12: # Preds ..B4.11
# Execution count [1.00e+00]
..LN420:
vmovsd %xmm0, (%rsp) #122.23[spill]
..LN421:
# LOE rbx
..B4.13: # Preds ..B4.12
# Execution count [1.00e+00]
..LN422:
.loc 1 123 is_stmt 1
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN423:
xorl %eax, %eax #123.5
..___tag_value__Z12do_timed_runRKmRd.269:
..LN424:
call __kmpc_ok_to_fork #123.5
..___tag_value__Z12do_timed_runRKmRd.270:
..LN425:
# LOE rbx eax
..B4.14: # Preds ..B4.13
# Execution count [1.00e+00]
..LN426:
testl %eax, %eax #123.5
..LN427:
je ..B4.17 # Prob 50% #123.5
..LN428:
# LOE rbx
..B4.15: # Preds ..B4.14
# Execution count [0.00e+00]
..LN429:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN430:
xorl %edx, %edx #123.5
..LN431:
incq %rdx #123.5
..LN432:
xorl %eax, %eax #123.5
..LN433:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.271:
..LN434:
call __kmpc_push_num_threads #123.5
..___tag_value__Z12do_timed_runRKmRd.272:
..LN435:
# LOE rbx
..B4.16: # Preds ..B4.15
# Execution count [0.00e+00]
..LN436:
movl $L__Z12do_timed_runRKmRd_123__par_region1_2.5, %edx #123.5
..LN437:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN438:
movl , %esi #123.5
..LN439:
lea 8(%rsp), %rcx #123.5
..LN440:
xorl %eax, %eax #123.5
..LN441:
lea 56(%rcx), %r8 #123.5
..LN442:
lea 112(%rcx), %r9 #123.5
..___tag_value__Z12do_timed_runRKmRd.273:
..LN443:
call __kmpc_fork_call #123.5
..___tag_value__Z12do_timed_runRKmRd.274:
..LN444:
jmp ..B4.20 # Prob 100% #123.5
..LN445:
# LOE rbx
..B4.17: # Preds ..B4.14
# Execution count [0.00e+00]
..LN446:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN447:
xorl %eax, %eax #123.5
..LN448:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.275:
..LN449:
call __kmpc_serialized_parallel #123.5
..___tag_value__Z12do_timed_runRKmRd.276:
..LN450:
# LOE rbx
..B4.18: # Preds ..B4.17
# Execution count [0.00e+00]
..LN451:
movl $___kmpv_zero_Z12do_timed_runRKmRd_1, %esi #123.5
..LN452:
lea 208(%rsp), %rdi #123.5
..LN453:
lea 8(%rsp), %rdx #123.5
..LN454:
lea 56(%rdx), %rcx #123.5
..LN455:
lea 112(%rdx), %r8 #123.5
..___tag_value__Z12do_timed_runRKmRd.277:
..LN456:
call L__Z12do_timed_runRKmRd_123__par_region1_2.5 #123.5
..___tag_value__Z12do_timed_runRKmRd.278:
..LN457:
# LOE rbx
..B4.19: # Preds ..B4.18
# Execution count [0.00e+00]
..LN458:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN459:
xorl %eax, %eax #123.5
..LN460:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.279:
..LN461:
call __kmpc_end_serialized_parallel #123.5
..___tag_value__Z12do_timed_runRKmRd.280:
..LN462:
# LOE rbx
..B4.20: # Preds ..B4.16 ..B4.19
# Execution count [1.00e+00]
..___tag_value__Z12do_timed_runRKmRd.281:
..LN463:
.loc 1 128 is_stmt 1
# omp_get_wtime()
call omp_get_wtime #128.23
如您所见,输出非常冗长且难以阅读。
每个核心时钟周期 1 个 FP 操作对于现代超标量 CPU 来说是可悲的。您的 Skylake-derived CPU 实际上每个内核每个时钟可以执行 2x 4-wide SIMD double-precision FMA 操作,并且每个 FMA 算作两个 FLOP,因此理论上最大值 = 16 double-precision FLOP每个核心时钟,所以 24 * 16 = 384
GFLOP/S。 (使用 4 double
s 的矢量,即 256 位宽的 AVX)。参见 FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
定时区域内有一个a函数调用,callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
(以及__kmpc_end_serialized_parallel
东西)。
没有与该调用目标关联的符号,所以我猜你没有在启用调试信息的情况下进行编译。 (这与优化级别是分开的,例如 gcc -g -O3 -march=native -fopenmp
应该 运行 相同的 asm,只是有更多的调试元数据。)即使是 OpenMP 发明的函数也应该在某些时候关联一个符号名称。
就基准有效性而言,一个好的试金石是它是否能根据问题的大小合理扩展。除非您超过 L3 缓存大小或者没有出现更小或更大的问题,否则时间应该以某种合理的方式更改。如果没有,那么您会担心它会优化掉,或者时钟速度 warm-up 影响(
- Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?
一旦你已经在 if
块中,你无条件地知道 else
块不应该 运行,所以你 jmp
在它上面而不是 jcc
(即使 FLAGS
仍然设置,所以您不必再次测试条件)。或者你把一个或另一个块 out-of-line (比如在函数的末尾,或在入口点之前)和 jcc
到它,然后它 jmp
s 回到另一个之后边。这允许快速路径在没有分支的情况下是连续的。
- Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?
Duplicate ret
来自“tail duplication”优化,其中多个执行路径,所有 return 都可以获得自己的 ret
而不是跳转到 ret
. (以及任何必要清理的副本,例如恢复 regs 和堆栈指针。)