C调用多个同一个程序集的执行时间呈指数增长

Question

下面的 C 代码应该简单地执行 p 次相同的汇编代码，这反过来应该只会在十六个循环中将 ecx 寄存器从 16 减少到 0。

当p小的时候，程序很快完成，但是当p大的时候（比如p = 16），它的执行时间呈指数增长。

#include <stdio.h>
#include <stdlib.h>

int main() {
    int p = 16;
    int i;
    for(i=0; i<p; i++) { 
        int c = 16;
        __asm__(
            "mov %[c], %%rcx \n"
            "loop: \n" 
                "sub , %%rcx \n"
                "jnz loop \n"
            : 
            : [c]"m" (c)
            : "rcx"
        );
    }
    return 0;
}

奇怪的是，当添加一些行来测量执行时间时，程序完成速度与预期一样快，没有任何指数增长效应：

#include <stdio.h>
#include <stdlib.h>
#include <time.h> //added

int main() {
    int p = 16;
    int i;
    clock_t start, end; //added
    start = clock(); //added
    for(i=0; i<p; i++) { 
        int c = 16;
        __asm__(
            "mov %[c], %%rcx \n"
            "loop: \n" 
                "sub , %%rcx \n"
                "jnz loop \n"
            : 
            : [c]"m" (c)
            : "rcx"
        );
    }
    end = clock(); //added
    float time = (float)(end - start)/CLOCKS_PER_SEC; //added
    printf("Time spent: %f\n", time); //added
    return 0;
}

如何避免此类问题？

Answer 1

您有 mov %[c], %%rcx，但 c 只有 int。如果内存中 c 之后的下四个字节恰好为非零，则您的 asm 循环将执行数十亿次迭代，而不仅仅是 16.

将 c 更改为 long int（或 int64_t 以移植到 long 不是 64 位的系统），或使用 mov %[c], %%ecx零扩展到 RCX，或 movsxd %[c], %%rcx 符号扩展。

实际上，没有特别需要从内存中加载rcx；让编译器通过创建一个带有 c 约束的 input/output 操作数来为您完成。使用 mov 启动 asm 模板效率低下。

        unsigned long c = 16;
        __asm__ volatile(
            "0: \n" 
                "sub , %%rcx \n"
                "jnz 0b \n"
            : "+c" (c));  // "c" forces the compiler to pick RCX

请注意，现在需要 volatile，因为 asm 现在有一个输出操作数，以后不会使用，因此编译器可能会优化掉整个块。（这也是您的原始代码的一个问题，除了 asm 完全没有输出操作数的语句有一个特殊的例外。我倾向于不喜欢依赖这个例外，因为它很难记住恰好在它适用的时候，并且很容易意外更改代码以使其不再适用。只要删除 asm 块是不可接受的，就使用 volatile。）

我还使用了 local label，这样代码将 assemble 正确，以防编译器决定展开循环。

您可以使用 "+r" 约束来代替硬编码 %rcx，并在循环中使用 dec %[c] 让编译器选择您的计数寄存器。使用 int c 它会选择 EAX 或 ECX，而不是 RCX。

C调用多个同一个程序集的执行时间呈指数增长

The execution time of C calling multiple the same assembly is increasing exponentially

c

performance

assembly

x86-64

inline-assembly