检查 ax 是否可以被 16 整除

Question

我们如何检查 ax 是否可以被 16 整除？
我知道我们可以通过这个命令：

AND 0x000f

有更快的命令吗？（我觉得 idiv 比较慢）

Answer 1

是的，and 是最快的指令之一，吞吐量和延迟与 add 等指令相同，比 idiv 快很多 ¹。（https://uops.info/, https://agner.org/optimize/, Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?（因为编译器使用了移位而不是 DIV。）

如果您要对其进行分支，test al, 0x0f / jnz not_multiple_of_16 在某些 CPU 上甚至更快（例如 AMD，包括 Zen，或 Intel Nehalem 及更早版本）可以 macro-fuse 使用 JCC 进行测试但不能 AND/JCC。 TEST 类似于 AND 但只设置标志而不写入目的地。（因此稍后读取 AX / EAX 的路径中不会有额外的 AND 延迟，并且能够读取原始值。）

此外，test al, imm8 只有 2 个字节，节省 code-size 与 3 个字节 and eax, 0xf 或 5 个字节 test eax, 0xf²。（我假设是 32 位或 64 位模式；直到写完这篇文章后我才注意到标题中的 AX，这暗示您可能正在针对 16 位模式进行优化。总体上没有显着差异。）

如果您想修改寄存器值本身以实际以 16 为模，那么可以使用 and eax, 0xf。（不是 and al, 0xf，您需要将高字节归零）。否则保持 EAX 不变，只写 FLAGS。

Sandybridge-family can macro-fuse AND/JCC, but and al, 0xf would write AL, introducing partial-register stalls if you wanted to read EAX later on P6-family CPUs (Nehalem and earlier). (Why doesn't GCC use partial registers?)。在 SnB 上，RMW 操作不会从 EAX 中单独重命名 AL，因此那里不需要稍后合并，并且它不是 false 依赖项，因为您明确想要测试

如果你想在另一个寄存器中得到 0 / 1 结果，那么 test al, 0xf / setnz cl 就可以了。

  xor   ecx, ecx
  test  al, 0xf
  setnz cl              ; ECX =  bool(x % 16U)

（如果您使用 setnz 或 cmovnz，test 与 and 相比没有任何优势，除了不修改 EAX。Macro-fusion仅在测试和 conditional-branches 之间，而不是 setnz。因此，如果您还想修改寄存器，作为创建布尔值，您可以在此处使用 and 而不是 test。）

脚注 1：idiv 慢得多， 并且需要额外的指令来设置 EDX，并将除数放在另一个寄存器中，并且不设置根据结果以有用的方式标记 FLAGS。事实上，div 和 idiv 是大多数 CPU 上最慢的整数数学指令，只有其他 heavily-microcoded 指令如 int 0x80 或 syscall，或者例如大 rep movsb 变慢。

最近的 CPU，如 Broadwell 和后来的 CPU 有相当漂亮的 high-performance 硬件部门（https://uops.info/, https://agner.org/optimize/), and apparently Ice Lake improves it some more (especially for 64-bit operand-size), but compilers try hard to avoid division. e.g. compilers will use multiple other instructions to implement x / 10, even for signed int x where that takes not just a multiplicative inverse for division but also some sign-bit handling to round towards 0 even for negative numbers. Why does GCC use multiplication by a strange number in implementing integer division?

当然对于 power-of-2 个除数，编译器知道他们可以使用 AND。

unsigned mod16(unsigned x) {
    return x % 16;
}

int mod16_signed(int x){
    return x % 16;            // negative for negative x, can't just use AND
}

asm on the Godbolt compiler explorer。使用 clang 和 GCC 编译 -O3 -m32 -mregparm=3（因此第一个参数到达 EAX，并在 EAX 中返回。）

# GCC and clang of course do this
mod16:
        and     eax, 15
        ret

但是签名比较难：

# clang12.0 -O3 -m32 -mregparm=3
mod16_signed:
        lea     ecx, [eax + 15]
        test    eax, eax                 # set FLAGS from x
        cmovns  ecx, eax                 # ecx = !(x<0) ? x : x+15
        and     ecx, -16                 # round ecx down to a multiple of 16
        sub     eax, ecx                 # return  x - round_down(ecx)
        ret

Clang 有一些 instruction-level 并行性（LEA 和 TEST 可以运行并行）并且可能在具有 single-uop cmov 的现代 CPU 上最好（AMD 和 Intel Broadwell 及以后）。 GCC 的所有 5 条指令都依赖于之前的指令，因此有 5 个周期延迟，而 clang 为 4 个。或者在具有 2-uop cmov.

的 CPU 上也是 5

GCC 使用稍微不同的策略，使用 cdq 将符号位广播到 EDX，然后右移 28 位在 EDX 中得到 0 或 0xf。

# gcc10.3 -O3 -m32 -mregparm=3
mod16_signed:
        cdq
        shr     edx, 28             # edx = (x>=0) ? 0 : 0xf
        add     eax, edx            # eax = (x>=0) ? x : x+15
        and     eax, 15
        sub     eax, edx            # if(x<0) eax++
        ret

脚注 2：TEST 没有 test r/m32, sign_extended_imm8 形式，这与原始 8086 中的所有其他立即指令不同。它很少在原始 8086 中很有用，仅适用于设置了 MSB 的情况（因此上半部分是 all-ones），因此您想测试是否设置了任何位 except 一些在低 8。 test ax, -8 就像检查 ax >> 3 是 non-zero.

x & 0 = 0, 所以 test al, 1 总是和 test ax, 1 一样；它不写目的地，所以只有 FLAGS 结果很重要。如果你愿意，你可以做 test ah, imm8，或者 test ax, imm16 在 16 位模式下只有 1 个额外字节，所以在设计 8086 时并没有错过很多节省。对于 32 位 operand-size，这是一个 3 字节的差异，但现代 CPU 通常不会在代码获取上出现瓶颈。

（通常越小越好以减少整体 L1i 缓存未命中，并且通常更好地打包到 uop 缓存中，整体较小的二进制文件从磁盘加载速度更快并且 iTLB 未命中更少，因此编译器应该支持更小的代码 当其他条件都相同时。通常不值得使用较慢的指令来节省代码大小，但仍然值得稍微展开热循环。）

检查 ax 是否可以被 16 整除

Check if ax is divisible by 16

optimization

x86

assembly

micro-optimization