retq 期间的程序集分段错误

Question

我有一些使用 callq 调用另一个的汇编代码。调用 retq 后，程序因分段错误而崩溃。

    .globl  main
main:                   # def main():
    pushq   %rbp        #
    movq    %rsp, %rbp  #

    callq   input       # get input
    movq    %rax, %r8

    callq   r8_digits_to_stack
    # program is not getting here before the segmentation fault
    jmp     exit_0

# put the binary digits of r8 on the stack, last digit first (lowest)
# uses: rcx, rbx
r8_digits_to_stack:
    movq    %r8, %rax       # copy for popping digits off

    loop_digits_to_stack:
        cmpq    [=10=], %rax    # if our copy is zero, we're done!
        jle     return

        movq    %rax, %rcx  # make another copy to extract digit with
        andq    , %rcx    # get last digit
        pushq   %rcx        # push last digit to stack
        sarq    %rax        # knock off last digit for next loop
        jmp     loop_digits_to_stack

# return from wherever we were last called
return:
    retq

# exit with code 0
exit_0:
    movq    [=10=], %rax    # return 0
    popq    %rbp
    retq

其中input是returns键盘输入到%rax的C函数。

我认为这可能与我正在操纵堆栈这一事实有关，是这样吗？

Answer 1

我认为您的 return 路径之一不会弹出 rbp。只需省略

pushq   %rbp
movq    %rsp, %rbp

pop     %rbp

一共。 gcc 的默认值是 -fomit-frame-pointer.

或者修正你的非return-零路径也弹出 rbp。

实际上，您被搞砸了，因为您的函数似乎旨在将东西放在堆栈上并且永远不会将其取下。如果您想发明自己的 ABI，其中堆栈指针下方的 space 可用于 return 数组，这很有趣，但您必须跟踪它们有多大，以便您可以调整 rsp 回到指向 ret 之前的 return 地址。

我建议不要将 return 地址加载到寄存器中，然后将后面的 ret 替换为 jmp *%rdx 或其他内容。这会在现代 CPUs 中抛出 call/return 地址预测逻辑，并导致与 b运行ch 错误预测相同的停顿。（参见 http://agner.org/optimize/）。 CPU 讨厌不匹配 call/ret。我现在找不到关于 link 的特定页面。

请参阅 https://whosebug.com/tags/x86/info 了解其他有用的资源，包括有关函数通常如何获取 args 的 ABI 文档。

您可以将return地址复制到您刚刚推送的数组下方，然后将运行ret复制到return并修改%rsp。但是除非你需要从多个调用点调用一个长函数，否则最好将它内联到一个或两个调用点。

如果它太大而无法在太多调用站点内联，最好的办法是模拟 call 并将 return 地址复制到新位置，而不是使用

=19=] 和 ret。来电者

    put args in some registers
    lea   .ret_location(%rip), %rbx
    jmp   my_weird_helper_function
.ret_location:  # in NASM/YASM, labels starting with . are local labels, and don't show up in the object file.
         # GNU assembler might only treat symbols starting with .L that way.
    ...


my_weird_helper_function:
    use args, potentially modifying the stack
    jmp *%rbx   # return

你需要一个很好的理由来使用这样的东西。而且您必须用大量评论来证明/解释它，因为它不是读者所期待的。首先，你打算如何处理这个压入堆栈的数组？是不是要通过减去rsp和rbp之类的方法求出它的长度？

有趣的是，即使 push 必须修改 rsp 以及进行存储，它在所有最近的 CPU 上都有一个每个时钟的吞吐量。英特尔 CPUs 有一个堆栈引擎，当它仅被 push/pop/call/ret 更改时，堆栈操作不必等待在乱序引擎中计算 rsp。（将 push/pop 与 mov 4(%rsp), %rax 或任何导致插入额外微指令以同步 OOO 引擎的 rsp 与堆栈引擎的偏移量的结果。） Intel/AMD CPUs 只能做无论如何每个时钟存储一个，但英特尔 SnB 和更高版本可以每个时钟弹出两次。

所以push/pop实际上并不是实现堆栈数据结构的糟糕方法，尤其是。在英特尔上。

此外，您的代码结构很奇怪。 main() 分为 r8_digits_to_stack。这很好，但你并没有利用从一个块掉落到另一个块的优势，所以它只会让你在 main 中额外花费 jmp，没有任何好处，而且会带来巨大的可读性缺点。

让我们假设您的循环是 main 的一部分，因为我已经谈到了修改 %rsp 的函数 return 是多么的奇怪。

您的循环也可以更简单。在可能的情况下，使用 jcc 返回顶部来构造事物。

避免使用高 16 位寄存器有一个小好处：具有经典寄存器的 32 位 insn 不需要 REX 前缀字节。所以让我们假装我们的起始值在 %rax.

digits_to_stack:
# put each bit of %rax into its own 8 byte element on the stack for maximum space-inefficiency

    movq   %rax, %rdx  # save a copy

    xor    %ecx, %ecx  # setcc is only available for byte operands, so zero %rcx

    # need a test at the top after transforming while() into do{}while
    test   %rax, %rax  # fewer insn bytes to test for zero this way
    jz  .Lend

    # Another option can be to jmp to the test at the end of the loop, to begin the first iteration there.

.align 16
.Lpush_loop:
    shr   , %rax   # shift the low bit into CF, set ZF based on the result
    setc  %cl       # set %cl to 0 or 1, based on the carry flag
    # movzbl %cl, %ecx  # zero-extend
    pushq %rcx
      #.Lfirst_iter_entry
      # test %rax, %rax   # not needed, flags still set from shr
    jnz  .Lpush_loop
.Lend:

这个版本仍然有点糟糕，因为在 Intel P6 / SnB CPU 系列上，在写入较小的部分后使用较宽的寄存器会导致速度变慢。（在 pre-SnB 上停止，或在 SnB 和更高版本上停止）。其他公司，包括 AMD 和 Silvermont，不单独跟踪部分寄存器，因此写入 %cl 依赖于 %rcx 的先前值。（写入 32 位 reg 会将上部 32 归零，这避免了部分 reg 依赖性问题。）movzx 从 byte 到 long 的零扩展将执行 Sandybridge 隐式执行的操作，并在较旧的 CPUs.

这在 Intel 上不会完全运行在每次迭代的单个周期中，但在 AMD 上可能会。 mov/and 还不错，但是 and 会影响标志，使得仅基于 shr 设置标志的循环变得更加困难。

请注意，您的旧版本 sarq %rax 会改变符号位，不一定是零，因此对于负输入，您的旧版本将是一个 inf 循环（当您运行出栈时会出现段错误 space（推送将尝试写入未映射的页面））。

retq 期间的程序集分段错误

Assembly segmentation fault during retq

assembly

stack

x86-64

segmentation-fault