调试迭代斐波那契（从 C 手动转换为 RISC-V）

Question

我正在尝试学习汇编，所以我尝试将此 c 代码转换为 RISC-v 汇编代码：

int Fib_Iter(int x){

int Fib_f, Fib_s, temp, Result;
Fib_f = 0;
Fib_s = 1;
if (x == 0)
    Result = Fib_f;
else if (x == 1)
    Result = Fib_s;
else
{
    while (x >= 2)
    {
        temp = Fib_f + Fib_s;
        Fib_f = Fib_s;
        Fib_s = temp;
        x--;
    }
    Result = Fib_s;
}
return Result;
}

这是我的 RISC-V 汇编代码：

   fib_iter:
    #temp in x5,
    #f_f in x6
    #f_s in x7
    #x in x12
    #res in x11
    #x28 =1
    #x29=2
    addi sp,sp,-16
    sw x5,0(sp)
    sw x6,4(sp)
    sw x7,8(sp)
    sw x11,12(sp)              #saving x5,x6,x7,x11 in the stack
    addi x28,x0,1             #adding 1 to x28 to use it in the 2ndif statment
    addi x29,x0,2              #adding 2 to x28 to use it in the 3rd if statment
    bne x12,x0,lb1             #first if statment to check if x==0
    add x11,x6,x0               #if x !=0 make result(x11) equal to (fb_f)x6
    lb1:
    bne x12,x28,lb2           #2nd if statement to check if x==1
    add x11,x7,x0            #if x !=1 make result(x11) equal to (fb_s)x6
    lb2:                      #if it equals to 1 start the loop
    loop:
    add x5,x6,x7             #just an add's 
    addi x6,x7,0
    addi x7,x5,0
    addi x12,x12,-1
    bge x12,x29,loop       #check if x >2 if yes go to the loop else 
    addi x11,x7,0          #result(x11)=x7
    lw x5,0(sp)            #load all the registers from the stack
    lw x6,4(sp)
    lw x7,8(sp)
    lw x11,12(sp)
    addi sp,sp,16         #return the stack pointer to its original place
    jalr x0,0(x1)

但我在使用金星模拟器时没有在寄存器 11 中获得正确的值。

当我使用值 4 调用它时，我得到了 4，但正确答案是 3。

Answer 1

你有 C 语言的伪代码，太棒了。

评论：

在汇编的翻译中缺少一些语句。
if-then-else 语句是不正确的：只应执行 then/else 中的一个，并且您必须通过从 then 部分中避免 else 部分来告诉处理器。
寄存器的使用不符合标准调用约定。我们期望 a0 中的第一个整数参数，因此我们期望您的 x 位于 a0 中。 return 值也应该在 a0 中。无需保存 t0、t1、t2、a1 — 它们的调用已被破坏，就像 t3、t4（您'保存并且不必保存）。

如果您想拥有自己的调用约定，那很好，但是您要 returning a1 中的值并从堆栈中恢复 a1，以及那些在一起没意义

查看内联评论：

    int Fib_Iter(int x) {   <------------- x is passed in a0
        int Fib_f, Fib_s, temp, Result;
        Fib_f = 0;          <------------- where is this line in the assembly??
        Fib_s = 1;          <------------- where is this line in the assembly??
        if (x == 0)
            Result = Fib_f;
        else if (x == 1)
            Result = Fib_s;
        else
        {
            while (x >= 2)
            {
                temp = Fib_f + Fib_s;
                Fib_f = Fib_s;
                Fib_s = temp;
                x--;
            }
            Result = Fib_s;
        }
        return Result;   <--------- return value goes in a0
    }

查看内联评论：

   fib_iter:
    #temp in x5,
    #f_f in x6
    #f_s in x7
    #x in x12       <---- x should be found in a0
    #res in x11     <---- return value should be put in a0 (at the end of the function)
    #x28 =1
    #x29=2
    addi sp,sp,-16  <---- no stack space needed
    sw x5,0(sp)     <---- no need to save t0
    sw x6,4(sp)     <---- no need to save t1
    sw x7,8(sp)     <---- no need to save t2
    sw x11,12(sp)   <---- no need to save a1
    addi x28,x0,1
    addi x29,x0,2
    bne x12,x0,lb1
    add x11,x6,x0   <---- then part, good
                    <---- missing code to skip else part
                       after executing a then part the code
                       (should skip the else part)
                       should resume the logically next thing after the if
    lb1:
    bne x12,x28,lb2
    add x11,x7,x0
                    <---- missing code to skip else part
    lb2:
    loop:
    add x5,x6,x7
    addi x6,x7,0
    addi x7,x5,0
    addi x12,x12,-1
    bge x12,x29,loop
    addi x11,x7,0
    lw x5,0(sp)      <---- no need to reload t0
    lw x6,4(sp)      <---- no need to reload t1
    lw x7,8(sp)      <---- no need to reload t2
    lw x11,12(sp)    <---- no need to reload a1 (this also clobbers current a1 return value)
    addi sp,sp,16    <---- no stack space is needed
    jalr x0,0(x1)

Answer 2

顺便提一下，你的C可以简化。您不需要单独检查 x == 0 和 x == 1；你可以做 if (x < 2) return x; 到 return 0 或 1。

(我假设你不打算处理负输入，所以 unsigned 可能是一个不错的选择。你的 C returns 1 对于负 x，达到循环但运行 0 次迭代，留下 Fib_s 未修改。但我认为这不是重要的行为。）

asm 中的最小实现比您的版本简单得多。这是一个叶函数（不调用其他函数），因此我们可以对所有内容使用调用破坏（“临时”）寄存器。我将“ABI names”用于寄存器，以帮助跟踪哪些传统上用于 arg 传递和调用破坏与调用保留。

实际上我从 clang 得到了很好的 asm，对于这个 C:

int Fib_Iter(int x){
    if (x < 2) 
        return x;

    int Fib_f = 0, Fib_s = 1;
    while (x >= 2) {
        int temp = Fib_f + Fib_s;
        Fib_f = Fib_s;
        Fib_s = temp;
        x--;
    }
    return Fib_s;
}

Godbolt 编译器资源管理器，RISC-V clang -O3

# clang -O3 output:
# arg:     int x      in  a0
# returns: int Fib(x) in a0

    Fib_Iter:
        addi    a1, zero, 2
        blt     a0, a1, .LBB0_3         # if(x<2) goto ret with x still in a0 as the retval
                # otherwise fall through to the rest and init vars
        mv      a3, zero                # Fib_f = 0
        addi    a2, a0, 1               #  tmpcount = x+1  compiler invented this
        addi    a0, zero, 1             # Fib_s = 1
          # x>=2 is known to be true on first iteration so a standard do{}while() loop structure works
.LBB0_2:                                # do{
        add     a4, zero, a0                # tmp = Fib_s
        addi    a2, a2, -1                  # tmpcount--
        add     a0, a0, a3                  # Fib_s += Fib_f
        add     a3, zero, a4                # Fib_f = tmp
        blt     a1, a2, .LBB0_2         # }while(2<tmpcount);
.LBB0_3:
        ret

相同的逻辑应该适用于无符号数，避免 return 负数 x 的怪异。 clang 用 unsigned 类型编译它有些不同，但我认为没有必要。

tmpcount = x+1 可能可以使用 ble（bge 的反向操作数）而不是 blt 来避免，因此我们可以使用 2 和 x 直接，保存另一条指令。

Fibonacci 展开得非常好：a += b; b += a; 每步需要一条指令，而不是 3 条。在每一步检查分支条件对于静态代码大小来说实际上是最好的，而且更好用于动态指令计数。（相关：存储斐波那契值数组的，包括一个展开版本，每个循环只检查一次分支条件，在进入循环之前使用聪明的启动来处理奇数与偶数）。

（当然，如果您针对非微小 n 进行优化，则可以在 log2(n) 移位和乘法/加法步骤中计算 Fibonacci(n)，而不是 n 次加法，.)

这是一个只重复循环条件的展开。不过，循环退出逻辑对于验证正确性来说并非易事，因此它更短但并不简单。

# arg:   unsigned n  in  a0
# returns:    Fib(n) in  a0

fib_unrolled:
        addi    t2, zero, 2
        bltu    a0, t2, .Lsmall_n         # if(n<2) return n
                # otherwise fall through
        mv      t0, zero               # a=0
        addi    t1, zero, 1            # b=1
             # known: x>=2  (unsigned) before first iteration
.Lloop:                                # do{
        beq     a0, t2, .first_out          # if(n==2) return a+b;
        addi    a0, a0, -2                  # n-=2
        add     t0, t0, t1                  # a += b
        add     t1, t1, t0                  # b += a
        bgeu    a0, t2, .Lloop         # }while(n >= 2);

        mv      a0, t1
.Lsmall_n:
        ret

.Lfirst_out:
        add      a0, t0, t1      # add instead of mv so the beq can be earlier
        ret

我能够让 clang 几乎完全从 C 源代码中重现这一点（使用不同的寄存器编号，但循环顺序完全相同。）包括将 add a+b 块放在常规 fall-through ret 之后。它比我更好地安排循环体中的指令，如果我们假设一个双问题有序流水线，它将两个 fib 序列添加分开。然而，clang 仍然坚持通过将 n >= 2 转换为 n > 1 来浪费一条指令加载 1 常量； RISC-V 可以将 bgeu 作为反向操作数 bltu (https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) 并且它已经在寄存器中具有 2。

unsigned fib_unroll(unsigned n){
    if (n < 2) 
        return n;

    unsigned a = 0, b = 1;
    do {
        if (n == 2) return a+b;
        a += b;
        n -= 2;          // clang schedules this way, better for an in-order pipeline
        b += a;
    }while (n >= 2);
    return b;
}

更巧妙的展开：循环内的一个分支。根据计数是奇数还是偶数来初始化两个变量

如果我们想要始终进行偶数次加法，我们可以安排从 0, 1 或 1, 0 开始，具体取决于 n & 1。从 1,0 开始意味着我们首先做 1+=0，基本上跳过迭代。（我最初想出这个技巧）

unsigned fib_unroll2_simpler(unsigned n){
    if (n < 2) 
        return n;
    unsigned b = n&1;  // current
    unsigned a = b^1;  // prev
    // start with 0,1 or 1,0 to get either 1 or 2 after two additions
    do {
        n -= 2;
        a += b;
        b += a;
    }while (n >= 2);
    return b;
}

这对n到结果有很长的数据依赖，尤其是对于小的n，做一些基本无用的工作。在小 n 的广泛无序执行机器上不是很好。所以这很有趣，但对于实际用例，您可能仍需要 table 查找小 n 起点。 clang 做得非常合理，但在开始时浪费了一些指令：

fib_unroll2_simpler:
        addi    a2, zero, 2
        add     a1, zero, a0
        bgeu    a0, a2, .LBB0_2
        add     a0, zero, a1            # copy `n` back into a0 where it already was?!?
        ret
.LBB0_2:                                # the non-tiny n common case has a taken branch
        andi    a0, a1, 1
        xori    a2, a0, 1
        addi    a3, zero, 1          # constant 1 to compare against
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        addi    a1, a1, -2
        add     a2, a2, a0
        add     a0, a0, a2
        bltu    a3, a1, .LBB0_3      # }while(1<n); fails to reuse the 2 it already had in a2 earlier
        ret

根据分支的成本，最好分支到循环的中间开始。这也意味着我们总是可以从 1,1 当我们进入循环时，而不是花费迭代添加零。但这使得 n==2 成为一个特例：我们需要 return 1，并且不能对 1+1 进行任何加法运算。但是 1 是我们的特例 return 值之一，因此我们可以将该路径调整为 return n != 0 并让函数的其余部分假设 n >= 3 或更高。

通过进一步优化以最小化 RISC-V 的指令数（例如，避免需要在寄存器中构造常量 2 以缩短非 tiny-n 常见情况），我想出了这个。（一个 _v1 版本在 Godbolt link）

unsigned fib_unroll2_branchy_v2(unsigned n){
    if (n <= 2) 
        return n!=0;  // 0 or 1
    n -= 3;     // check for n<=2 and copy n on a machine without FLAGS
    unsigned i = n&1;

    unsigned b = 1;
    //if (n==2) return b;  // already eliminated.
    unsigned a = 1;

    if (i == 0) goto odd_entry;   // n-=3 flips the low bit, so this detects odd n
    do{
        a += b;
odd_entry:
        i += 2;
        b += a;
    }while (i <= n);  // safe even for n near uint_max because we subtracted 3 first
    return b;
}

clang 在这里没有做最佳工作，在我们有条件地跳入的循环中浪费了一些复制指令。（当您在 C 中执行此操作时，编译器通常会遇到困难，但有时这对于手写 asm 来说是一个有用的技巧）。因此，这里有一个手写版本，它并没有那么糟糕：

fib_unroll2_branchy_v2:
        addi    t2, a0, -3            # n -= 3  (leaving a copy of the orig)
        bleu    t2, a0, .Lsmall_n     # if( (n-3) > n) detect wrapping, i.e. n<=2

        andi    t0, t2, 1             # i = n&1
        addi    a0, zero, 1           # b = retval, replacing orig_n
        addi    a1, zero, 1           # a
        beqz    t0, .Lodd_entry       # even i means orig_n was odd

.Lloop:                              # do{
        add     a1, a1, a0            # a += b
.Lodd_entry:
        addi    t0, t0, 2             # i += 2
        add     a0, a0, a1            # b += a
        bleu    t0, t2, .Lloop       # }while(i <= n);
        ret

.Lsmall_n
        snez    a0, a0                # return orig_n != 0 handles n<3
        ret

我可能错过了一些优化。在 fib_unroll2_simpler（无分支的）中，最好找到一些 ILP（而不是基本上除了最终 n-=2 之外的一个长依赖链），或者在到达循环终止时快速启动进行更少的迭代，而不是将循环的前半部分变成空操作。这个版本只需要最终结果，不需要像我的 x86 答案那样将每个 Fib 值存储到数组中。

即使是 branchy_v2 版本也感觉像是一个比我们真正想要初始化的更长的 dep 链 i，但在不是超宽的管道上它会很好。

调试迭代斐波那契（从 C 手动转换为 RISC-V）

Debugging iterative Fibonacci (converted by hand from C to RISC-V)

assembly

fibonacci

riscv

更巧妙的展开：循环内的一个分支。根据计数是奇数还是偶数来初始化两个变量