在我的程序中使用 rdtsc() 来获取单字和双字操作的时钟周期数？

Question

理论上双字的成本addition/subtraction是单字的2倍。同样，单字乘法与加法的成本比取为 3。我在 Ubuntu LTS 14.04 上使用 GCC 编写了以下 C 程序来检查我的机器上的时钟周期数，Intel Sandy Bridge Corei5- 2410M。虽然，大多数时候程序 returns 6 个时钟周期用于 128 位加法，但我已经采取了最好的情况。我使用命令（gcc -o ow -O3 cost.c）编译，结果如下

32-bit Add: Clock cycles = 1    64-bit Add: Clock cycles = 1    64-bit Mult: Clock cycles = 2   128-bit Add: Clock cycles = 5

程序如下：

#define n 500
#define counter 50000

typedef uint64_t utype64;
typedef int64_t type64;
typedef __int128 type128;

__inline__ utype64 rdtsc() {
        uint32_t lo, hi;
        __asm__ __volatile__ ("xorl %%eax,%%eax \n        cpuid"::: "%rax", "%rbx", "%rcx", "%rdx");
        __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
        return (utype64)hi << 32 | lo;
}

int main(){
    utype64 start, end;
    type64 a[n], b[n], c[n];
    type128 d[n], e[n], f[n];
    int g[n], h[n];
    unsigned short i, j;
    srand(time(NULL));
    for(i=0;i<n;i++){ g[i]=rand(); h[i]=rand(); b[i]=(rand()+2294967295); e[i]=(type128)(rand()+2294967295)*(rand()+2294967295);}
    for(j=0;j<counter;j++){
       start=rdtsc();
       for(i=0;i<n;i++){ a[i]=(type64)g[i]+h[i]; }
       end=rdtsc();
       if((j+1)%5000 == 0)
          printf("%lu-bit Add: Clock cycles = %lu \t", sizeof(g[0])*8, (end-start)/n);

       start=rdtsc();
       for(i=0;i<n;i++){ c[i]=a[i]+b[i]; }
       end=rdtsc();
       if((j+1)%5000 == 0)
          printf("%lu-bit Add: Clock cycles = %lu \t", sizeof(a[0])*8, (end-start)/n);

       start=rdtsc();
       for(i=0;i<n;i++){ d[i]=(type128)c[i]*b[i]; }
       end=rdtsc();
       if((j+1)%5000 == 0)
          printf("%lu-bit Mult: Clock cycles = %lu \t", sizeof(c[0])*8, (end-start)/n);

       start=rdtsc();
       for(i=0;i<n;i++){ f[i]=d[i]+e[i]; }
       end=rdtsc();
       if((j+1)%5000 == 0){
          printf("%lu-bit Add: Clock cycles = %lu \n", sizeof(d[0])*8, (end-start)/n);
        printf("f[%hu]= %ld %ld \n\n", i-7, (type64)(f[i-7]>>64), (type64)(f[i-7]));}
   }

return 0;
}

结果中有两件事困扰着我。

1）（64位）乘法的时钟周期数能否变为2？

2）为什么双字相加的时钟周期数是单字相加的2倍多？

我主要关注案例（2）。现在，问题出现了，是因为我的程序逻辑吗？还是因为GCC编译器优化？

Answer 1

In theory we know that the double-word addition/subtraction takes 2 times of a single-word.

不，我们没有。

Similarly, the cost ratio of single-word multiplication to addition is taken as 3 because of fast integer multiplier of CPU.

不，不是。

你不是在测量指令。您正在测量程序中的语句。这可能与您的编译器将发出的指令有任何关系，也可能没有任何关系。例如，我的编译器在修复你的代码以便编译之后，对一些循环进行了矢量化。每条指令添加多个值。第一个循环本身仍然有 23 个指令长，并且仍然被您的代码报告为 1 个循环。

现代（与过去 25 年一样）CPU不会一次执行一条指令。他们将同时执行多个指令，并且可以乱序执行。

然后你有内存访问。在您的 CPU 上，没有指令可以从内存中获取一个值，将其添加到内存中的另一个值，然后将其存储在第三个内存位置。所以必须已经执行了多条指令。此外，内存访问的成本比算术指令高得多，以至于任何接触内存的东西（除非它一直命中 L1 缓存）都将由内存访问时间支配。

此外，RDTSC 甚至可能 return 实际的循环计数。一些 CPU 具有可变时钟速率，但仍然保持 TSC 以相同的速率运行，无论 CPU 实际上运行有多快或多慢，因为操作系统使用 TSC 来保持时间.其他人没有。

所以你没有衡量你认为你正在衡量的东西，而告诉你这些事情的人要么过于简单化，要么已经二十年没有看过 CPU 文档了。

在我的程序中使用 rdtsc() 来获取单字和双字操作的时钟周期数？

The use of rdtsc() in my program to obtain the number of clock cycles for single- and double-word operations?

c

gcc

rdtsc