使用 AT&T 语法将整数打印为字符串，使用 Linux 系统调用而不是 printf

Question

我已经编写了一个汇编程序来按照 AT&T 语法显示数字的阶乘。但它不起作用。这是我的代码

.text 

.globl _start

_start:
movq ,%rcx
movq ,%rax


Repeat:                     #function to calculate factorial
   decq %rcx
   cmp [=11=],%rcx
   je print
   imul %rcx,%rax
   cmp ,%rcx
   jne Repeat
# Now result of factorial stored in rax
print:
     xorq %rsi, %rsi

  # function to print integer result digit by digit by pushing in 
       #stack
  loop:
    movq [=11=], %rdx
    movq , %rbx
    divq %rbx
    addq , %rdx
    pushq %rdx
    incq %rsi
    cmpq [=11=], %rax
    jz   next
    jmp loop

  next:
    cmpq [=11=], %rsi
    jz   bye
    popq %rcx
    decq %rsi
    movq , %rax
    movq , %rbx
    movq , %rdx
    int  [=11=]x80
    addq , %rsp
    jmp  next
bye:
movq ,%rax
movq [=11=], %rbx
int  [=11=]x80


.data
   num : .byte 5

这个程序没有打印任何东西，我还使用 gdb 来可视化它在循环函数之前工作正常，但是当它进入下一个时，一些随机值开始进入各种寄存器。帮我调试一下，这样它就可以打印阶乘了。

Answer 1

几件事：

0) 我猜这是64b linux 环境，但你应该这么说（如果不是，我的一些观点将无效）

1) int 0x80 是 32b 调用，但您使用的是 64b 寄存器，因此您应该使用 syscall（和不同的参数）

2) int 0x80, eax=4 要求 ecx 包含存储内容的内存地址，而你在 ecx 中给它 ASCII 字符 = 非法内存访问（第一次调用应该 return 错误，即 eax 是负值）。或者使用 strace <your binary> 应该显示错误的参数 + 错误 returned.

3) 为什么 addq , %rsp？对我来说没有意义，你正在破坏 rsp，所以下一个 pop rcx 会弹出错误的值，最后你会运行方式 "up" 进入堆栈。

...也许更多，我没有调试它，这个列表只是通过阅读源代码（所以我什至可能在某些地方出错，尽管这种情况很少见）。

顺便说一句，您的代码正在运行。它只是没有达到您的预期。但是工作正常，正如 CPU 的设计和您在代码中编写的那样。这是否确实实现了您想要的或有意义，那是另一个话题，但不要责怪硬件或汇编程序。

...我可以快速猜测例程如何修复（只是部分 hack-fix，仍然需要在 64b syscall 下重写 linux）：

  next:
    cmpq [=10=], %rsi
    jz   bye
    movq %rsp,%rcx    ; make ecx to point to stack memory (with stored char)
      ; this will work if you are lucky enough that rsp fits into 32b
      ; if it is beyond 4GiB logical address, then you have bad luck (syscall needed)
    decq %rsi
    movq , %rax
    movq , %rbx
    movq , %rdx
    int  [=10=]x80
    addq , %rsp     ; now rsp += 8; is needed, because there's no POP
    jmp  next

我自己也没有尝试，只是从头开始写，让我知道它是如何改变情况的。

Answer 2

正如@ped7g 指出的那样，您做错了几件事：在 64 位代码中使用 int 0x80 32 位 ABI，并传递字符值而不是指向 write() 的指针系统调用。

下面是如何在 x8-64 中打印整数 Linux，简单且有点高效¹ 方式，使用相同的重复 division / modulo by 10.

系统调用很昂贵（write(1, buf, 1) 可能需要数千个周期），并且执行 syscall inside 循环会在寄存器上进行，因此既不方便又笨重而且效率低下.我们应该将字符写入一个小缓冲区，按打印顺序（最低地址的最高有效数字），并对其进行单个 write() 系统调用。

但是我们需要一个缓冲区。 64 位整数的最大长度只有 20 位十进制数字，所以我们可以只使用一些堆栈 space。在 x86-64 Linux 中，我们可以在 RSP（最多 128B）以下使用堆栈 space，而无需通过修改 RSP 来“保留”它。这叫做red-zone。如果您想将缓冲区传递给另一个函数而不是系统调用，则必须将 space 保留为 sub , %rsp 或其他内容。

不用硬编码系统调用号，使用 GAS 可以轻松使用 .h 文件中定义的常量。 注意 mov $__NR_write, %eax接近函数的末尾。 The x86-64 SystemV ABI passes system-call arguments in similar registers to the function-calling convention. (So it's totally different from the 32-bit int 0x80 ABI, which you 64 位代码。）

// building with  gcc foo.S  will use CPP before GAS so we can use headers
#include <asm/unistd.h>    // This is a standard Linux / glibc header file
      // includes unistd_64.h or unistd_32.h depending on current mode
      // Contains only #define constants (no C prototypes) so we can include it from asm without syntax errors.

.p2align 4
.globl print_integer            #void print_uint64(uint64_t value)
print_uint64:
    lea   -1(%rsp), %rsi        # We use the 128B red-zone as a buffer to hold the string
                                # a 64-bit integer is at most 20 digits long in base 10, so it fits.

    movb  $'\n', (%rsi)         # store the trailing newline byte.  (Right below the return address).
    # If you need a null-terminated string, leave an extra byte of room and store '\n[=10=]'.  Or  push $'\n'

    mov    , %ecx            # same as  mov , %rcx  but 2 bytes shorter
    # note that newline (\n) has ASCII code 10, so we could actually have stored the newline with  movb %cl, (%rsi) to save code size.

    mov    %rdi, %rax           # function arg arrives in RDI; we need it in RAX for div
.Ltoascii_digit:                # do{
    xor    %edx, %edx
    div    %rcx                  #  rax = rdx:rax / 10.  rdx = remainder

                                 # store digits in MSD-first printing order, working backwards from the end of the string
    add    $'0', %edx            # integer to ASCII.  %dl would work, too, since we know this is 0-9
    dec    %rsi
    mov    %dl, (%rsi)           # *--p = (value%10) + '0';

    test   %rax, %rax
    jnz  .Ltoascii_digit        # } while(value != 0)
    # If we used a loop-counter to print a fixed number of digits, we would get leading zeros
    # The do{}while() loop structure means the loop runs at least once, so we get "0\n" for input=0

    # Then print the whole string with one system call
    mov   $__NR_write, %eax     # call number from asm/unistd_64.h
    mov   , %edi              # fd=1
    # %rsi = start of the buffer
    mov   %rsp, %rdx
    sub   %rsi, %rdx            # length = one_past_end - start
    syscall                     # write(fd=1 /*rdi*/, buf /*rsi*/, length /*rdx*/); 64-bit ABI
    # rax = return value (or -errno)
    # rcx and r11 = garbage (destroyed by syscall/sysret)
    # all other registers = unmodified (saved/restored by the kernel)

    # we don't need to restore any registers, and we didn't modify RSP.
    ret

为了测试这个函数，我把它放在同一个文件中调用它并退出：

.p2align 4
.globl _start
_start:
    mov    120123425329922, %rdi
#    mov    [=11=], %edi    # Yes, it does work with input = 0
    call   print_uint64

    xor    %edi, %edi
    mov    $__NR_exit, %eax
    syscall                             # sys_exit(0)

我将其构建为静态二进制文件（没有 libc）：

$ gcc -Wall -static -nostdlib print-integer.S && ./a.out 
10120123425329922
$ strace ./a.out  > /dev/null
execve("./a.out", ["./a.out"], 0x7fffcb097340 /* 51 vars */) = 0
write(1, "10120123425329922\n", 18)     = 18
exit(0)                                 = ?
+++ exited with 0 +++
$ file ./a.out 
./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84, not stripped

脚注 1： 参见 Why does GCC use multiplication by a strange number in implementing integer division? for avoiding div r64 for division by 10, because that's very slow (21 to 83 cycles on Intel Skylake）。乘法逆会使这个函数实际上有效，而不仅仅是“有点”。（当然还有优化的空间...）

相关：Linux x86-32 扩展精度循环 从每个 32 位“肢”打印 9 个十进制数字：参见 .toascii_digit: in my Extreme Fibonacci code-golf answer。它针对代码大小进行了优化（即使以牺牲速度为代价），但评论很好。

它像您一样使用 div，因为它比使用快速乘法逆运算更小）。它使用 loop 进行外循环（为扩展精度使用多个整数），再次用于 code-size at the cost of speed.

它使用 32 位 int 0x80 ABI，并打印到保存“旧”斐波那契值而不是当前值的缓冲区。

另一种获得高效 asm 的方法是使用 C 编译器。 对于数字循环，请查看 gcc 或 clang 为该 C 源代码生成的内容（这基本上是什么asm 正在做）。 Godbolt 编译器资源管理器使尝试不同的选项和不同的编译器版本变得容易。

请参阅 gcc7.2 -O3 asm output，它几乎是 print_uint64 中循环的替代品（因为我选择了 args 进入相同的寄存器）：

void itoa_end(unsigned long val, char *p_end) {
  const unsigned base = 10;
  do {
    *--p_end = (val % base) + '0';
    val /= base;
  } while(val);

  // write(1, p_end, orig-current);
}

我通过注释掉 syscall 指令并在函数调用周围放置一个重复循环来测试 Skylake i7-6700k 的性能。 mul %rcx / shr , %rdx 的版本比 div %rcx 的版本快 5 倍，用于将长数字字符串 (10120123425329922) 存储到缓冲区中。 div 版本运行每时钟 0.25 条指令，而 mul 版本运行每时钟 2.65 条指令（尽管需要更多指令）。

可能值得展开 2，然后 divide 乘以 100，然后将其余部分分成 2 位数。这将提供更好的指令级并行性，以防更简单的版本在 mul + shr 延迟上出现瓶颈。将 val 归零的 multiply/shift 操作链将是原来的一半长，每个短的独立依赖链中有更多的工作来处理 0-99 的余数。

相关：

NASM 这个答案的版本，对于 x86-64 或 i386 Linux How do I print an integer in Assembly Level Programming without printf from the c library?
- Base 16 是 2的幂，转换更简单并且不需要div.

使用 AT&T 语法将整数打印为字符串，使用 Linux 系统调用而不是 printf

Printing an integer as a string with AT&T syntax, with Linux system calls instead of printf

linux

assembly

x86-64

att