为什么 GCC 分配的堆栈内存比需要的多？

Question

我正在阅读“计算机系统：程序员的视角，3/E”(CS:APP3e)，以下代码是书中的示例：

long call_proc() {
    long  x1 = 1;
    int   x2 = 2;
    short x3 = 3;
    char  x4 = 4;
    proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
    return (x1+x2)*(x3-x4);
}

书上给出了GCC生成的汇编代码：

long call_proc()
call_proc:
    ; Set up arguments to proc
    subq    , %rsp           ; Allocate 32-byte stack frame
    movq    , 24(%rsp)        ; Store 1 in &x1
    movl    , 20(%rsp)        ; Store 2 in &x2
    movw    , 18(%rsp)        ; Store 3 in &x3
    movb    , 17(%rsp)        ; Store 4 in &x4
    leaq    17(%rsp), %rax      ; Create &x4
    movq    %rax, 8(%rsp)       ; Store &x4 as argument 8
    movl    , (%rsp)          ; Store 4 as argument 7
    leaq    18(%rsp), %r9       ; Pass &x3 as argument 6
    movl    , %r8d            ; Pass 3 as argument 5
    leaq    20(%rsp), %rcx      ; Pass &x2 as argument 4
    movl    , %edx            ; Pass 2 as argument 3
    leaq    24(%rsp), %rsi      ; Pass &x1 as argument 2
    movl    , %edi            ; Pass 1 as argument 1
    ; Call proc
    call    proc
    ; Retrieve changes to memory
    movslq  20(%rsp), %rdx      ; Get x2 and convert to long
    addq    24(%rsp), %rdx      ; Compute x1+x2
    movswl  18(%rsp), %eax      ; Get x3 and convert to int
    movsbl  17(%rsp), %ecx      ; Get x4 and convert to int
    subl    %ecx, %eax          ; Compute x3-x4
    cltq                        ; Convert to long
    imulq   %rdx, %rax          ; Compute (x1+x2) * (x3-x4)
    addq    , %rsp           ; Deallocate stack frame
    ret                         ; Return

我能看懂这段代码：编译器在栈上分配了32个字节的space，其中前16个字节保存传递给proc的参数，后16个字节保存4个本地变量。

然后我在 GCC 11.2 上测试了这段代码，使用优化标志 -Og，得到了这段汇编代码：

call_proc():
        subq    , %rsp
        movq    , 8(%rsp)
        movl    , 4(%rsp)
        movw    , 2(%rsp)
        movb    , 1(%rsp)
        leaq    1(%rsp), %rax
        pushq   %rax
        pushq   
        leaq    18(%rsp), %r9
        movl    , %r8d
        leaq    20(%rsp), %rcx
        movl    , %edx
        leaq    24(%rsp), %rsi
        movl    , %edi
        call    proc(long, long*, int, int*, short, short*, char, char*)
        movslq  20(%rsp), %rax
        addq    24(%rsp), %rax
        movswl  18(%rsp), %edx
        movsbl  17(%rsp), %ecx
        subl    %ecx, %edx
        movslq  %edx, %rdx
        imulq   %rdx, %rax
        addq    , %rsp
        ret

我注意到gcc首先为4个局部变量分配了24个字节。然后它使用 pushq 向堆栈添加 2 个参数，因此最终代码使用 addq , %rsp 释放堆栈 space.

相比书上的代码，GCC在这里多分配了8个字节的space，似乎并没有使用额外的space。为什么它需要额外的 space?

Answer 1

（此答案是 Antti Haapala、klutt 和 Peter Cordes 在上面发表的评论的摘要。）

GCC 分配的 space 多于“必要”，以确保堆栈正确对齐以调用 proc：堆栈指针必须调整为 16 的倍数，加上8（即 8 的奇数倍）。

奇怪的是书中的代码没有这样做；所示代码会违反 ABI，如果 proc 实际上依赖于正确的堆栈对齐（例如使用对齐的 SSE2 指令），它可能会崩溃。

因此看来要么是书中的代码是从编译器输出中错误地复制的，要么是本书的作者使用了一些改变 ABI 的不寻常的编译器标志。

现代 GCC 11.2 使用 -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args 发出几乎相同的 asm (Godbolt)，前者将 ABI 更改为仅保持 2^3 字节堆栈对齐，低于默认的 2^4 . （以这种方式编译的代码不能安全地调用正常编译的任何东西，甚至是标准库函数。）-maccumulate-outgoing-args 曾经是旧版 GCC 的默认设置，但现代 CPU 有一个“堆栈引擎”，使得 push/pop single-uop 因此该选项不再是默认选项； push for stack args 节省了一点代码大小。

与书中的 asm 的一个区别是调用前的 movl [=14=], %eax，因为没有原型，所以调用者必须假设它可能是可变参数并传递 AL = XMM 寄存器中 FP args 的数量。（与传递的 args 相匹配的原型会阻止这种情况。）其他指令都是相同的，并且与本书使用的任何旧 GCC 版本的顺序相同，除了 call proc [=49= 之后的寄存器选择]：它最终使用 movslq %edx, %rdx 而不是 cltq （sign-extend with RAX）。

CS:APP 3e 全球版 由出版商（而非作者）介绍，但显然此代码也出现在北美版中.所以这可能是作者的错误/选择使用带有奇怪选项的实际编译器输出。与一些糟糕的全球版本实践问题不同，这段代码可能来自某些 GCC 版本，但只有 non-standard 个选项。

相关：Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC 有一个 missed-optimization 错误，它有时会保留它真正不需要的额外 16 个字节。不过，这不是这里发生的事情。

为什么 GCC 分配的堆栈内存比需要的多？

Why does GCC allocate more stack memory than needed?

c

gcc

x86-64

compiler-optimization

stack-memory