System V ABI - AMD64 - GCC 发出的程序集中的堆栈对齐

Question

对于下面的 C 代码，来自 Compiler Explorer 的 GCC x86-64 10.2 发出了我在下面进一步粘贴的程序集。

一条指令是subq , %rsp。问题是，为什么从 %rsp 中减去 40 个字节不会使堆栈错位？我的理解是：

就在 call foo 之前，堆栈是 16 字节对齐的；
call foo 将一个 8 字节的 return 地址放在堆栈上，因此堆栈未对齐；
但是 pushq %rbp 在 foo 的开始处将另外 8 个字节放在堆栈上，因此它再次对齐 16 个字节；
因此堆栈在 subq , %rsp 之前对齐 16 个字节。因此，将 %rsp 减少 40 个字节必须打破对齐？

显然，GCC 在保持堆栈对齐方面发出有效程序集，所以我一定遗漏了一些东西。

（我尝试用 CLANG 替换 GCC，CLANG 发出 subq , %rsp — 正如我直觉所期望的那样。）

那么，我在 GCC 生成的程序集中缺少什么？它如何保持堆栈 16 字节对齐？

int bar(int i) { return i; }
int foo(int p0, int p1, int p2, int p3, int p4, int p5, int p6) {
    int sum = p0 + p1 + p2 + p3 + p4 + p5 + p6;
    return bar(sum);
}
int main() {
    return foo(0, 1, 2, 3, 4, 5, 6);
}

bar:
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %edi, -4(%rbp)
        movl    -4(%rbp), %eax
        popq    %rbp
        ret
foo:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    , %rsp
        movl    %edi, -20(%rbp)
        movl    %esi, -24(%rbp)
        movl    %edx, -28(%rbp)
        movl    %ecx, -32(%rbp)
        movl    %r8d, -36(%rbp)
        movl    %r9d, -40(%rbp)
        movl    -20(%rbp), %edx
        movl    -24(%rbp), %eax
        addl    %eax, %edx
        movl    -28(%rbp), %eax
        addl    %eax, %edx
        movl    -32(%rbp), %eax
        addl    %eax, %edx
        movl    -36(%rbp), %eax
        addl    %eax, %edx
        movl    -40(%rbp), %eax
        addl    %eax, %edx
        movl    16(%rbp), %eax
        addl    %edx, %eax
        movl    %eax, -4(%rbp)
        movl    -4(%rbp), %eax
        movl    %eax, %edi
        call    bar
        leave
        ret
main:
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   
        movl    , %r9d
        movl    , %r8d
        movl    , %ecx
        movl    , %edx
        movl    , %esi
        movl    [=11=], %edi
        call    foo
        addq    , %rsp
        leave
        ret

Answer 1

16 字节对齐的目的是为了让函数在低于当前的任何级别被调用，如果它们需要对齐，则不必担心对齐它们的堆栈当地人。

如果没有 ABI 保证，每个需要此功能的函数都必须 and 具有某些值的堆栈指针以确保其正确对齐，例如：

and %rsp, [=10=]xfffffffffffffff0

但是，没有理由说明为什么在这种特殊情况下这是必要的 - bar() 函数是叶函数，这意味着编译器有完全了解其级别或以下级别的任何对齐要求（它没有局部变量，并且不调用任何函数，因此没有要求）。

foo() 函数也没有下面的要求，因为它唯一调用的是 bar()。它似乎也在决定它是 自己的 本地人也不需要那种级别的对齐。

即使 bar() 或 foo() 是从直接翻译单元外部调用的（并且它们可以是，因为它们没有标记 static), 这并没有改变不需要对齐它们的事实。

例如，如果 bar 在一个单独的翻译单元中，或者它调用了无法确定不需要对齐的其他函数，情况就会有所不同。

这意味着 gcc 不会完全了解其对齐要求。而且，事实上，如果你在 godbolt 中注释掉 bar 定义行（有效地隐藏定义），你会看到该行更改：

// int bar(int i) { return i; }
   --> subq , %rsp             ; no longer

顺便说一句，虽然在这种情况下 16 字节对齐在技术上 不是必需的 ，但我认为它可能使声明无效gcc 使用 System V AMD64 ABI。该 ABI 中似乎没有任何内容允许这种偏差，文本 (PDF) 指出（略有释义，并使用我的粗体）：

The end of the input argument area shall be aligned on a 16 (or 32 if __m256 is passed on stack) byte boundary. In other words, the value %rsp + 8 is always a multiple of 16 (or 32) when control is transferred to the function entry point. The stack pointer %rsp always points to the end of the latest allocated stack frame.

似乎没有任何余地来解释，以任何方式使观察到的行为兼容，即使已知在这种情况下不会引起问题。

是否有人认为重要到足以担心不在这个答案的范围内，我对此不作任何判断:-)

System V ABI - AMD64 - GCC 发出的程序集中的堆栈对齐

System V ABI - AMD64 - Stack alignment in GCC-emitted assembly

assembly

stack

x86-64

calling-convention

memory-alignment