错误的基准,令人费解的组装

Faulty benchmark, puzzling assembly

这里是汇编新手。我编写了一个基准测试来衡量机器在计算转置矩阵张量积时的浮点性能。

鉴于我的机器配备 32GiB RAM(带宽 ~37GiB/s)和 Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz(Turbo 4.0GHz)处理器,我估计最大性能(使用流水线和寄存器中的数据)为 6 核 x 4.0GHz = 24GFLOP/s。但是,当我 运行 我的基准测试时,我测量的是 127GFLOP/s,这显然是一个错误的测量值。

注意:为了测量 FP 性能,我正在测量操作数:n*n*n*n*6n^3 用于矩阵-矩阵乘法,在 n 个复杂的切片上执行数据点,即假设 6 个 FLOPs 用于 1 个复杂的复数乘法)并将其除以每个 运行.

所花费的平均时间

主函数中的代码片段:

// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
    #pragma noinline
    do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);

代码片段:do_timed_run:

void do_timed_run(const std::size_t& n, double& avg_dur)
{
    // create the data and lay first touch
    auto operand0 = matrix<double>(n, n);
    auto operand1 = tensor<double>(n, n, n);
    auto result = tensor<double>(n, n, n);
    
    // first touch
    #pragma omp parallel
    {
        set_first_touch(operand1);
        set_first_touch(result);
    }
    
    // do the experiment
    const auto dur1 = omp_get_wtime() * 1E+6;
    #pragma omp parallel firstprivate(operand0)
    {
        #pragma noinline
        transp_matrix_tensor_mult(operand0, operand1, result);
    }
    const auto dur2 = omp_get_wtime() * 1E+6;
    avg_dur += dur2 - dur1;
}

备注:

  1. 此时,我不提供函数 transp_matrix_tensor_mult 的代码,因为我认为它不相关。
  2. #pragma noinline 是我用来更好地理解反汇编程序输出的调试装置。

现在开始反汇编函数 do_timed_run:

0000000000403a20 <_Z12do_timed_runRKmRd>:
  403a20:   48 81 ec d8 00 00 00    sub    [=12=]xd8,%rsp
  403a27:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403a2e:   00 
  403a2f:   48 89 fd                mov    %rdi,%rbp
  403a32:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403a39:   00 
  403a3a:   48 89 f3                mov    %rsi,%rbx
  403a3d:   48 89 ee                mov    %rbp,%rsi
  403a40:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403a45:   48 89 ea                mov    %rbp,%rdx
  403a48:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403a4f:   00 
  403a50:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403a57:   00 
  403a58:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403a5f:   00 
  403a60:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403a67:   00 
  403a68:   e8 03 f8 ff ff          callq  403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
  403a6d:   48 89 ee                mov    %rbp,%rsi
  403a70:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403a75:   48 89 ea                mov    %rbp,%rdx
  403a78:   48 89 e9                mov    %rbp,%rcx
  403a7b:   e8 80 f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a80:   48 89 ee                mov    %rbp,%rsi
  403a83:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403a88:   48 89 ea                mov    %rbp,%rdx
  403a8b:   48 89 e9                mov    %rbp,%rcx
  403a8e:   e8 6d f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a93:   bf 88 f3 44 00          mov    [=12=]x44f388,%edi
  403a98:   e8 53 f7 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403a9d:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403aa4:   bf c0 f3 44 00          mov    [=12=]x44f3c0,%edi
  403aa9:   33 c0                   xor    %eax,%eax
  403aab:   e8 20 f6 ff ff          callq  4030d0 <__kmpc_ok_to_fork@plt>
  403ab0:   85 c0                   test   %eax,%eax
  403ab2:   74 21                   je     403ad5 <_Z12do_timed_runRKmRd+0xb5>
  403ab4:   ba a5 3c 40 00          mov    [=12=]x403ca5,%edx
  403ab9:   bf c0 f3 44 00          mov    [=12=]x44f3c0,%edi
  403abe:   be 02 00 00 00          mov    [=12=]x2,%esi
  403ac3:   48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  403ac8:   33 c0                   xor    %eax,%eax
  403aca:   4c 8d 41 38             lea    0x38(%rcx),%r8
  403ace:   e8 cd f5 ff ff          callq  4030a0 <__kmpc_fork_call@plt>
  403ad3:   eb 41                   jmp    403b16 <_Z12do_timed_runRKmRd+0xf6>
  403ad5:   bf c0 f3 44 00          mov    [=12=]x44f3c0,%edi
  403ada:   33 c0                   xor    %eax,%eax
  403adc:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403ae3:   e8 58 f7 ff ff          callq  403240 <__kmpc_serialized_parallel@plt>
  403ae8:   be 9c 13 47 00          mov    [=12=]x47139c,%esi
  403aed:   48 8d bc 24 d0 00 00    lea    0xd0(%rsp),%rdi
  403af4:   00 
  403af5:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  403afa:   48 8d 4a 38             lea    0x38(%rdx),%rcx
  403afe:   e8 a2 01 00 00          callq  403ca5 <_Z12do_timed_runRKmRd+0x285>
  403b03:   bf c0 f3 44 00          mov    [=12=]x44f3c0,%edi
  403b08:   33 c0                   xor    %eax,%eax
  403b0a:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b11:   e8 aa f7 ff ff          callq  4032c0 <__kmpc_end_serialized_parallel@plt>
  403b16:   e8 85 f6 ff ff          callq  4031a0 <omp_get_wtime@plt>
  403b1b:   c5 fb 11 04 24          vmovsd %xmm0,(%rsp)
  403b20:   bf f8 f3 44 00          mov    [=12=]x44f3f8,%edi
  403b25:   33 c0                   xor    %eax,%eax
  403b27:   e8 a4 f5 ff ff          callq  4030d0 <__kmpc_ok_to_fork@plt>
  403b2c:   85 c0                   test   %eax,%eax
  403b2e:   74 25                   je     403b55 <_Z12do_timed_runRKmRd+0x135>
  403b30:   ba 0b 3c 40 00          mov    [=12=]x403c0b,%edx
  403b35:   bf f8 f3 44 00          mov    [=12=]x44f3f8,%edi
  403b3a:   be 03 00 00 00          mov    [=12=]x3,%esi
  403b3f:   48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  403b44:   33 c0                   xor    %eax,%eax
  403b46:   4c 8d 41 38             lea    0x38(%rcx),%r8
  403b4a:   4c 8d 49 70             lea    0x70(%rcx),%r9
  403b4e:   e8 4d f5 ff ff          callq  4030a0 <__kmpc_fork_call@plt>
  403b53:   eb 45                   jmp    403b9a <_Z12do_timed_runRKmRd+0x17a>
  403b55:   bf f8 f3 44 00          mov    [=12=]x44f3f8,%edi
  403b5a:   33 c0                   xor    %eax,%eax
  403b5c:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b63:   e8 d8 f6 ff ff          callq  403240 <__kmpc_serialized_parallel@plt>
  403b68:   be a0 13 47 00          mov    [=12=]x4713a0,%esi
  403b6d:   48 8d bc 24 d0 00 00    lea    0xd0(%rsp),%rdi
  403b74:   00 
  403b75:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  403b7a:   48 8d 4a 38             lea    0x38(%rdx),%rcx
  403b7e:   4c 8d 42 70             lea    0x70(%rdx),%r8
  403b82:   e8 84 00 00 00          callq  403c0b <_Z12do_timed_runRKmRd+0x1eb>
  403b87:   bf f8 f3 44 00          mov    [=12=]x44f3f8,%edi
  403b8c:   33 c0                   xor    %eax,%eax
  403b8e:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b95:   e8 26 f7 ff ff          callq  4032c0 <__kmpc_end_serialized_parallel@plt>
  403b9a:   e8 01 f6 ff ff          callq  4031a0 <omp_get_wtime@plt>
  403b9f:   c5 fb 5c 0c 24          vsubsd (%rsp),%xmm0,%xmm1
  403ba4:   c5 fb 10 05 cc c4 01    vmovsd 0x1c4cc(%rip),%xmm0        # 420078 <alpha_beta.61562.0.0.28+0x28>
  403bab:   00 
  403bac:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403bb1:   c4 e2 f9 a9 0b          vfmadd213sd (%rbx),%xmm0,%xmm1
  403bb6:   c5 fb 11 0b             vmovsd %xmm1,(%rbx)
  403bba:   e8 71 f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bbf:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403bc4:   e8 67 f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bc9:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403bce:   e8 5d f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bd3:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403bda:   00 
  403bdb:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403be2:   00 
  403be3:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403bea:   00 
  403beb:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403bf2:   00 
  403bf3:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403bfa:   00 
  403bfb:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403c02:   00 
  403c03:   48 81 c4 d8 00 00 00    add    [=12=]xd8,%rsp
  403c0a:   c3                      retq   
  403c0b:   48 81 ec d8 00 00 00    sub    [=12=]xd8,%rsp
  403c12:   4c 89 c6                mov    %r8,%rsi
  403c15:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403c1c:   00 
  403c1d:   4c 8d 24 24             lea    (%rsp),%r12
  403c21:   4c 89 e7                mov    %r12,%rdi
  403c24:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403c2b:   00 
  403c2c:   48 89 cd                mov    %rcx,%rbp
  403c2f:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403c36:   00 
  403c37:   48 89 d3                mov    %rdx,%rbx
  403c3a:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403c41:   00 
  403c42:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403c49:   00 
  403c4a:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403c51:   00 
  403c52:   e8 49 03 00 00          callq  403fa0 <_ZN5s3dft6matrixIdEC1ERKS1_> # <--- Here starts the part with the function call...
  403c57:   4c 89 e7                mov    %r12,%rdi
  403c5a:   48 89 de                mov    %rbx,%rsi
  403c5d:   48 89 ea                mov    %rbp,%rdx
  403c60:   e8 8b 01 00 00          callq  403df0 <_Z25transp_matrix_tensor_multIdEvRKN5s3dft6matrixIT_EERKNS0_6tensorIS2_EERS7_>
  403c65:   4c 89 e7                mov    %r12,%rdi
  403c68:   e8 63 01 00 00          callq  403dd0 <_ZN5s3dft6matrixIdED1Ev>     # <--- ...and here it ends
  403c6d:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403c74:   00 
  403c75:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403c7c:   00 
  403c7d:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403c84:   00 
  403c85:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403c8c:   00 
  403c8d:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403c94:   00 
  403c95:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403c9c:   00 
  403c9d:   48 81 c4 d8 00 00 00    add    [=12=]xd8,%rsp
  403ca4:   c3                      retq   
  403ca5:   48 81 ec d8 00 00 00    sub    [=12=]xd8,%rsp
  403cac:   48 89 d7                mov    %rdx,%rdi
  403caf:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403cb6:   00 
  403cb7:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403cbe:   00 
  403cbf:   48 89 cb                mov    %rcx,%rbx
  403cc2:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403cc9:   00 
  403cca:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403cd1:   00 
  403cd2:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403cd9:   00 
  403cda:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403ce1:   00 
  403ce2:   e8 99 f4 ff ff          callq  403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt> # <--- here are the calls to set-first-touch
  403ce7:   48 89 df                mov    %rbx,%rdi
  403cea:   e8 91 f4 ff ff          callq  403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt>
  403cef:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403cf6:   00 
  403cf7:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403cfe:   00 
  403cff:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403d06:   00 
  403d07:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403d0e:   00 
  403d0f:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403d16:   00 
  403d17:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403d1e:   00 
  403d1f:   48 81 c4 d8 00 00 00    add    [=12=]xd8,%rsp
  403d26:   c3                      retq   
  403d27:   48 89 04 24             mov    %rax,(%rsp)
  403d2b:   bf 30 f4 44 00          mov    [=12=]x44f430,%edi
  403d30:   e8 bb f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d35:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d3c:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403d41:   e8 9a 00 00 00          callq  403de0 <_ZN5s3dft6tensorIdED1Ev>
  403d46:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403d4b:   e8 90 00 00 00          callq  403de0 <_ZN5s3dft6tensorIdED1Ev>
  403d50:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403d55:   e8 76 00 00 00          callq  403dd0 <_ZN5s3dft6matrixIdED1Ev>
  403d5a:   48 8b 3c 24             mov    (%rsp),%rdi
  403d5e:   e8 5d f3 ff ff          callq  4030c0 <_Unwind_Resume@plt>
  403d63:   48 89 04 24             mov    %rax,(%rsp)
  403d67:   bf 68 f4 44 00          mov    [=12=]x44f468,%edi
  403d6c:   e8 7f f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d71:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d78:   eb cc                   jmp    403d46 <_Z12do_timed_runRKmRd+0x326>
  403d7a:   48 89 04 24             mov    %rax,(%rsp)
  403d7e:   bf a0 f4 44 00          mov    [=12=]x44f4a0,%edi
  403d83:   e8 68 f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d88:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d8f:   eb bf                   jmp    403d50 <_Z12do_timed_runRKmRd+0x330>
  403d91:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  403d98:   00 
  403d99:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

主要问题:

  1. 我假设函数是在 定时区域之外调用的,我的假设是否正确?
  2. 如果以上是真的,为什么会这样?
  3. 如果上述情况不正确,我如何找出我的基准测试出错的原因?

次要问题:

  1. 为什么代码中有无条件跳转(在 403ad3、403b53、403d78 和 403d8f)?
  2. 为什么同一个函数中有 3 个 retq 个实例,只有一个 return 路径(在 403c0a、403ca4 和 403d26)?

请注意,我只提供了我认为相关的信息。如有需要,我们将很乐意提供更多信息。预先感谢您的宝贵时间。

编辑:

@PeterCordes 我 did 在启用调试符号的情况下构建。上面发布的程序集是使用 objdump 获得的,它以某种方式没有检索到所需的符号。这是使用 icpc:

获得的程序集的(片段)
#       omp_get_wtime()
        call      omp_get_wtime                                 #122.23
..___tag_value__Z12do_timed_runRKmRd.267:
..LN419:
                                # LOE rbx xmm0
..B4.12:                        # Preds ..B4.11
                                # Execution count [1.00e+00]
..LN420:
        vmovsd    %xmm0, (%rsp)                                 #122.23[spill]
..LN421:
                                # LOE rbx
..B4.13:                        # Preds ..B4.12
                                # Execution count [1.00e+00]
..LN422:
    .loc    1  123  is_stmt 1
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN423:
        xorl      %eax, %eax                                    #123.5
..___tag_value__Z12do_timed_runRKmRd.269:
..LN424:
        call      __kmpc_ok_to_fork                             #123.5
..___tag_value__Z12do_timed_runRKmRd.270:
..LN425:
                                # LOE rbx eax
..B4.14:                        # Preds ..B4.13
                                # Execution count [1.00e+00]
..LN426:
        testl     %eax, %eax                                    #123.5
..LN427:
        je        ..B4.17       # Prob 50%                      #123.5
..LN428:
                                # LOE rbx
..B4.15:                        # Preds ..B4.14
                                # Execution count [0.00e+00]
..LN429:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN430:
        xorl      %edx, %edx                                    #123.5
..LN431:
        incq      %rdx                                          #123.5
..LN432:
        xorl      %eax, %eax                                    #123.5
..LN433:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.271:
..LN434:
        call      __kmpc_push_num_threads                       #123.5
..___tag_value__Z12do_timed_runRKmRd.272:
..LN435:
                                # LOE rbx
..B4.16:                        # Preds ..B4.15
                                # Execution count [0.00e+00]
..LN436:
        movl      $L__Z12do_timed_runRKmRd_123__par_region1_2.5, %edx #123.5
..LN437:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN438:
        movl      , %esi                                      #123.5
..LN439:
        lea       8(%rsp), %rcx                                 #123.5
..LN440:
        xorl      %eax, %eax                                    #123.5
..LN441:
        lea       56(%rcx), %r8                                 #123.5
..LN442:
        lea       112(%rcx), %r9                                #123.5
..___tag_value__Z12do_timed_runRKmRd.273:
..LN443:
        call      __kmpc_fork_call                              #123.5
..___tag_value__Z12do_timed_runRKmRd.274:
..LN444:
        jmp       ..B4.20       # Prob 100%                     #123.5
..LN445:
                                # LOE rbx
..B4.17:                        # Preds ..B4.14
                                # Execution count [0.00e+00]
..LN446:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN447:
        xorl      %eax, %eax                                    #123.5
..LN448:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.275:
..LN449:
        call      __kmpc_serialized_parallel                    #123.5
..___tag_value__Z12do_timed_runRKmRd.276:
..LN450:
                                # LOE rbx
..B4.18:                        # Preds ..B4.17
                                # Execution count [0.00e+00]
..LN451:
        movl      $___kmpv_zero_Z12do_timed_runRKmRd_1, %esi    #123.5
..LN452:
        lea       208(%rsp), %rdi                               #123.5
..LN453:
        lea       8(%rsp), %rdx                                 #123.5
..LN454:
        lea       56(%rdx), %rcx                                #123.5
..LN455:
        lea       112(%rdx), %r8                                #123.5
..___tag_value__Z12do_timed_runRKmRd.277:
..LN456:
        call      L__Z12do_timed_runRKmRd_123__par_region1_2.5  #123.5
..___tag_value__Z12do_timed_runRKmRd.278:
..LN457:
                                # LOE rbx
..B4.19:                        # Preds ..B4.18
                                # Execution count [0.00e+00]
..LN458:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN459:
        xorl      %eax, %eax                                    #123.5
..LN460:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.279:
..LN461:
        call      __kmpc_end_serialized_parallel                #123.5
..___tag_value__Z12do_timed_runRKmRd.280:
..LN462:
                                # LOE rbx
..B4.20:                        # Preds ..B4.16 ..B4.19
                                # Execution count [1.00e+00]
..___tag_value__Z12do_timed_runRKmRd.281:
..LN463:
    .loc    1  128  is_stmt 1
#       omp_get_wtime()
        call      omp_get_wtime                                 #128.23

如您所见,输出非常冗长且难以阅读。

每个核心时钟周期 1 个 FP 操作对于现代超标量 CPU 来说是可悲的。您的 Skylake-derived CPU 实际上每个内核每个时钟可以执行 2x 4-wide SIMD double-precision FMA 操作,并且每个 FMA 算作两个 FLOP,因此理论上最大值 = 16 double-precision FLOP每个核心时钟,所以 24 * 16 = 384 GFLOP/S。 (使用 4 doubles 的矢量,即 256 位宽的 AVX)。参见 FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

定时区域内有一个a函数调用,callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>(以及__kmpc_end_serialized_parallel东西)。

没有与该调用目标关联的符号,所以我猜你没有在启用调试信息的情况下进行编译。 (这与优化级别是分开的,例如 gcc -g -O3 -march=native -fopenmp 应该 运行 相同的 asm,只是有更多的调试元数据。)即使是 OpenMP 发明的函数也应该在某些时候关联一个符号名称。

就基准有效性而言,一个好的试金石是它是否能根据问题的大小合理扩展。除非您超过 L3 缓存大小或者没有出现更小或更大的问题,否则时间应该以某种合理的方式更改。如果没有,那么您会担心它会优化掉,或者时钟速度 warm-up 影响( 等等,比如 page-faults。)


  1. Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?

一旦你已经在 if 块中,你无条件地知道 else 块不应该 运行,所以你 jmp 在它上面而不是 jcc(即使 FLAGS 仍然设置,所以您不必再次测试条件)。或者你把一个或另一个块 out-of-line (比如在函数的末尾,或在入口点之前)和 jcc 到它,然后它 jmps 回到另一个之后边。这允许快速路径在没有分支的情况下是连续的。

  1. Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?

Duplicate ret 来自“tail duplication”优化,其中多个执行路径,所有 return 都可以获得自己的 ret 而不是跳转到 ret. (以及任何必要清理的副本,例如恢复 regs 和堆栈指针。)