在 MSVC 内联 asm 中对数组的偶数元素求和

Question

我正在尝试在 Assembly 中编写一个函数，它将 return 数组中偶数元素的总和。这是我的代码

       int sum(int *a, int n)
       {
          int S = 0;
          _asm {

          mov eax, [ebp + 8]
          mov edx, 0
          mov ebx, [ebp + 12]
          mov ecx, 0
          for1: cmp ecx, ebx
          jge endfor
          and [eax + ecx * 4], 1
          jz even
          inc ecx
          jmp for1

          even: add edx, [eax + ecx * 4]            
          inc ecx
          jmp for1
          endfor: mov S, edx
          }
      return S;
      }

但它不起作用。有谁知道这是什么问题，我该如何解决？谢谢

Answer 1

只是一些猜测（特别是因为我不知道你使用的是什么编译器，而且你不清楚你所说的 "not working" 是什么意思）：

      mov eax, [ebp + 8]
      mov edx, 0
      mov ebx, [ebp + 12]
      mov ecx, 0

我假设这应该将两个参数 a 和 n 加载到寄存器中。你怎么知道偏移量是正确的？可不可以直接直接引用名字？

      and [eax + ecx * 4], 1

这会破坏输入数组中的元素（如果是奇数则将其设置为 1，如果是偶数则将其设置为 0），这可能是您不想要的。您可能应该使用 test 指令（即 non-destructive）而不是 and。

      even: add edx, [eax + ecx * 4]

这将添加 0，因为您已经通过我上面提到的 and 指令将 [eax + ecx * 4] 设置为 0。基于此，我希望您的功能始终 return 0.

Answer 2

如果你想自己从堆栈中获取参数，为什么还要使用内联 asm？要么让编译器给你 args（这样它在内联后仍然有效），要么在单独的文件中编写纯 asm。

这是您编写的很好的函数版本。请注意缩进和更高效的代码（分支结构，并加载到寄存器中而不是在内存中进行比较，然后用作添加的内存操作数。如果预期条件不太可能出现，则 test 或不过，cmp 使用内存操作数是有意义的。）

int sum(const int *a, int n)
{
  int S;

  _asm {
      mov    ecx, n
      xor    eax,eax              // Sum = 0
      test   ecx,ecx
      jz   .early_out             // n==0 case

      mov    esi, a
      lea    edi, [esi + 4*ecx]   // end pointer = &a[n]
      xor    ecx,ecx              // zeroed register for CMOV
  sum_loop:                   // do{
      // Assume even/odd is unpredictable, so do it branchlessly:
      mov    edx, [esi]
      test   edx, 1               // ZF cleared for odd numbers only
      cmovnz edx, ecx             // zero out EDX for odd numbers only
      add    eax, edx             // add zero or a[i]

      add    esi, 4
      cmp    esi, edi         // while(++pointer < end_pointer)
      jb   sum_loop

  early_out:
      mov S, eax     // MSVC does actually support leaving a value in EAX and falling off the end of a non-void function (without return),
                     // but clang -fasm-blocks doesn't.  And there's no way to explicitly tell the compiler where the output(s) are in this dumb syntax, unlike GNU C inline asm
  }

  return S;
}

有了分支，简单的方法就是

.sum_loop:
    mov   edx, [esi]
    test  edx, 1
    jnz .odd                    // conditionally skip the add
    add   eax, edx
.odd:
    add    esi, 4
    cmp    esi, edi             // pointer < end pointer
    jb  .sum_loop

您的原始分支结构（复制 odd/even 分支的循环尾部）在某些情况下是一种有用的技术，但在添加到总数后它们再次相同的情况下不是这样。

在 asm 中，用底部的条件分支编写循环，。在进入循环之前检查暗示零迭代的条件。你不想要一个未被采纳的 cmp/jcc 和一个被采纳的无条件分支。

从 Haswell 开始，

lodsd 在 Intel 上与 mov eax, [esi] / add esi, 4 一样高效。它在早期的 Intel 和 AMD 上是 3 微指令，因此它以牺牲速度为代价来节省代码大小。在这种情况下，这意味着我们不能在 EAX 中求和，尽管这并不重要，因为从 asm 块中获取数据的安全方法是 mov-store 它，而不是将它留在return-值寄存器。 MSVC 似乎确实支持 EAX 作为 return 值寄存器的语义，即使在内联包含 asm{} 块的函数之后也是如此，但我不知道是否记录了这一点。 clang -fasm-blocks 肯定不会。

test dl, 1 在当前代码中会稍微高效一些，指令字节更少（3 个而不是 6 个）：没有 test r/m32, sign_extend_imm8，只有 32 位立即数。所以如果要测试低字节的位，就用低8位的部分寄存器。为简单起见，我测试了我们加载到的相同寄存器。

加载到 EAX 实际上会提高效率，允许 test al,1 作为 2 字节指令而不是任何其他字节寄存器是 3 字节。较小的代码大小通常更好，其他条件相同，除非它碰巧导致某些后续代码的不幸对齐，例如如果宏融合 cmp/jcc 触及 32 字节块的末尾，则 Skylake 上的 JCC 错误。

见x86 tag wiki for guides and stuff, also https://whosebug.com/tags/inline-assembly/info

在 MSVC 内联 asm 中对数组的偶数元素求和

Sum even elements of an array in MSVC inline asm

x86

assembly

inline-assembly

visual-c++