8 位循环的内联程序集大小不匹配

Question

我正在尝试使用内联汇编在 C 中编写向左旋转操作，如下所示：

byte rotate_left(byte a) {
    __asm__("rol %0, ": "=a" (a) : "a" (a));
    return a;
}

（其中 byte 类型定义为 unsigned char）。

这会引发错误

/tmp/ccKYcEHR.s:363: Error: operand size mismatch for `rol'.

这里有什么问题？

Answer 1

AT&T 语法使用与 Intel 语法相反的顺序。旋转计数必须是第一个，而不是最后一个：rol , %0.

此外，您不需要也不应该为此使用内联汇编：https://gcc.gnu.org/wiki/DontUseInlineAsm

如 Best practices for circular shift (rotate) operations in C++ 中所述，GNU C 具有用于窄循环的内在函数，因为循环习语识别代码无法优化循环计数的 and。 x86 shifts/rotates 使用 count & 31 屏蔽计数，即使对于 8 位和 16 位，但旋转仍然环绕。不过，这对轮班很重要。

无论如何，gcc 有一个内置函数用于窄旋转以避免任何开销。在 x86intrin.h 中有一个 __rolb 包装器，但 MSVC 使用它自己的 __rotr8 等等 intrin.h。无论如何，clang 不支持 __builtin 或 x86intrin.h 旋转包装器，但 gcc 和 ICC 支持。

#include <stdint.h>
uint8_t rotate_left_byte_by1(uint8_t a) {
    return __builtin_ia32_rolqi(a, 1);  // qi = quarter-integer
}

我像普通人一样使用 stdint.h 中的 uint8_t，而不是定义 byte 类型。

这根本不能用 clang 编译，但是 it compiles as you'd hope with gcc7.2:

rotate_left_byte_by1:
    movl    %edi, %eax
    rolb    %al
    ret

这为您提供了一个函数，它的编译效率与您的内联 asm 一样高，但它可以针对编译时常量进行完全优化，并且编译器知道它是如何工作的/它做了什么，并且可以相应地进行优化。

8 位循环的内联程序集大小不匹配

Inline assembly size mismatch for 8-bit rotate

c

x86

inline-assembly

att