测量函数的 rdtsc 时序

Question

我想用 rdtsc 为函数调用计时。所以我通过以下两种方式进行了测量。

循环调用。聚合循环内的每个 rdtsc 差异并除以调用次数。（假设这是 N）
循环调用。得到循环本身的rdtsc差值除以N.

但我看到一些不一致的行为。

当我增加 N 时，方法 1 和方法 2 中的时间都在相当单调地减少。对于方法 2，这是可以理解的，因为它会分摊循环控制开销。但是我不确定方法1是怎么回事。
实际上，对于方法 2，每次当我增加 N 时，我得到的 N=1 的值似乎每次都只除以新的 N。检查 gdb 反汇编让我意识到这是在 -O2 处进行了一些编译器优化，其中在第二种情况下跳过了循环。所以我用 -O0 重试，其中 gdb 反汇编显示了第二种情况的实际循环。

代码如下。

    #include <stdio.h>
    #include <inttypes.h>
    #include <stdlib.h>

    typedef unsigned long long ticks;

    static __inline__ ticks getticks(void) {
      unsigned a, d; 
      asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
      return ((ticks)a) | (((ticks)d) << 32); 
    }

    __attribute__ ((noinline))
    void bar() {

    }

    int main(int argc, char** argv) {

       long long N = 1000000; 
       N = atoi(argv[1]);
       int i;
       long long bar_total = 0;

       ticks start = 0, end = 0;

       for (i = 0; i < N; i++) {
         start = getticks();
         bar();
         end = getticks();
         bar_total += (end - start);
       } 

       fprintf(stdout, "Total invocations : %lld\n", N);
       fprintf(stdout, "[regular] bar overhead : %lf\n", ((double)bar_total/  N));

      start = getticks();
      for (i = 0; i < N; i++) {
        bar();
      } 
      end = getticks();

      bar_total = (end - start);

      fprintf(stdout, "[Loop] bar overhead : %lf\n", ((double)bar_total/ N));

      return 0;

     }

知道这里发生了什么吗？如果需要，我也可以进行 gdb 反汇编。我使用了 http://dasher.wustl.edu/tinker/distribution/fftw/kernel/cycle.h

中的 rdtsc 实现

编辑： 我将不得不收回我的第二个陈述，即在第二种情况下，在 -O0 处，时间与 N 成正比地下降。我想这是我在构建过程中犯的一些错误，导致一些旧版本持续存在。无论如何它仍然随着方法 1 的数字而有所下降。这里有一些不同 N 值的数字。

taskset -c 2 ./example.exe 1
Total invocations : 1
[regular] bar overhead : 108.000000
[Loop] bar overhead : 138.000000

taskset -c 2 ./example.exe 10
Total invocations : 10
[regular] bar overhead : 52.900000
[Loop] bar overhead : 40.700000

taskset -c 2 ./example.exe 100
Total invocations : 100
[regular] bar overhead : 46.780000
[Loop] bar overhead : 15.570000

taskset -c 2 ./example.exe 1000
Total invocations : 1000
[regular] bar overhead : 46.069000
[Loop] bar overhead : 13.669000

taskset -c 2 ./example.exe 100000
Total invocations : 10000
[regular] bar overhead : 46.010100
[Loop] bar overhead : 13.444900

taskset -c 2 ./example.exe 100000000
Total invocations : 100000000
[regular] bar overhead : 26.970272
[Loop] bar overhead : 5.201252

taskset -c 2 ./example.exe 1000000000
Total invocations : 1000000000
[regular] bar overhead : 18.853279
[Loop] bar overhead : 5.218234

taskset -c 2 ./example.exe 10000000000
Total invocations : 1410065408
[regular] bar overhead : 18.540719
[Loop] bar overhead : 5.216395

我现在看到两个新行为。

方法一收敛速度比方法二慢。但我还是很纳闷为什么不同的N设置值会有如此大的差异。也许我在这里犯了一些我目前没有看到的基本错误。
方法 1 的值实际上比方法 2 大一些。我预计它与方法 2 的值相同或略小，因为它不包含循环控制开销。

问题

总而言之，我的问题是

为什么两种方法给出的值在增加N时变化如此之大？特别适用于不考虑循环控制开销的方法 1。
为什么第一种方法在计算中排除了循环控制开销后，第二种方法结果小于第一种方法？

编辑 2

关于建议的 rdtscp 解决方案。

由于对内联汇编一窍不通，我做了以下操作。

static __inline__ ticks getstart(void) {
  unsigned cycles_high = 0, cycles_low = 0; 
  asm volatile ("CPUID\n\t"
             "RDTSC\n\t"
             "mov %%edx, %0\n\t"
             "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
             "%rax", "%rbx", "%rcx", "%rdx");
  return ((ticks)cycles_high) | (((ticks)cycles_low) << 32); 
}

static __inline__ ticks getend(void) {
  unsigned cycles_high = 0, cycles_low = 0; 
  asm volatile("RDTSCP\n\t"
         "mov %%edx, %0\n\t"
          "mov %%eax, %1\n\t"
           "CPUID\n\t": "=r" (cycles_high), "=r" (cycles_low)::
           "%rax", "%rbx", "%rcx", "%rdx");
  return ((ticks)cycles_high) | (((ticks)cycles_low) << 32); 
}

并在函数调用前后使用了上述方法。但是现在我得到了如下不合理的结果。

Total invocations : 1000000
[regular] bar overhead : 304743228324.708374
[Loop] bar overhead : 33145641307.734016

有什么收获？我想将它们分解为内联方法，因为我在多个地方看到它的使用。

一个。解决办法在评论。

Answer 1

您使用普通的 rdtsc 指令，这可能无法在乱序 CPU 上正常工作，例如 Xeons 和 Cores。您应该添加一些序列化指令或切换到 rdtscp instruction:

http://en.wikipedia.org/wiki/Time_Stamp_Counter

Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.[3] This problem can be solved by executing a serializing instruction, such as CPUID, to force every preceding instruction to complete before allowing the program to continue, or by using the RDTSCP instruction, which is a serializing variant of the RDTSC instruction.

Intel 最近有使用手册 rdtsc/rdtscp - How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures (ia-32-ia-64-benchmark-code-execution-paper.pdf, 324264-001, 2010)。他们推荐 cpuid+rdtsc 开始和 rdtscp 结束定时器：

The solution to the problem presented in Section 0 is to add a CPUID instruction just after the RDTPSCP and the two mov instructions (to store in memory the value of edx and eax). The implementation is as follows:

asm volatile ("CPUID\n\t"
 "RDTSC\n\t"
 "mov %%edx, %0\n\t"
 "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
"%rax", "%rbx", "%rcx", "%rdx");
/***********************************/
/*call the function to measure here*/
/***********************************/
asm volatile("RDTSCP\n\t"
 "mov %%edx, %0\n\t"
 "mov %%eax, %1\n\t"
 "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1)::
"%rax", "%rbx", "%rcx", "%rdx");

start = ( ((uint64_t)cycles_high << 32) | cycles_low );
end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );

In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. Nevertheless, this call does not affect the measurement since it comes before the RDTSC (i.e., before the timestamp register is read). The first RDTSC then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. If the code is a call to a function, it is recommended to declare such function as “inline” so that from an assembly perspective there is no overhead in calling the function itself. The RDTSCP instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed.

你的例子不太正确；您尝试测量空函数 bar()，但它太短以至于您在方法 1 (for() { rdtsc; bar(); rdtsc)) 中测量 rdtsc 开销。根据 Agner Fog 的 table for haswell - http://www.agner.org/optimize/instruction_tables.pdf 第 191 页（长 table "Intel Haswell List of instruction timings and μop breakdown"，在它的最后） RDTSC 有 15 微指令（不可能融合）和 24 滴答的延迟； RDTSCP（对于较旧的微体系结构，Sandy Bridge 的延迟为 23 微指令和 36 滴答，而 rdtsc 为 21 微指令和 28 滴答）。因此，您不能使用普通的 rdtsc（或 rdtscp）来直接测量这种短代码。

Answer 2

你试过clock_gettime(CLOCK_MONOTONIC, &tp)了吗？应该非常接近于手动读取循环计数器，同时请记住，循环计数器可能在 cpu 个内核之间不同步。

测量函数的 rdtsc 时序

rdtsc timing for a measuring a function

performance

gcc

x86-64

rdtsc