循环"xorl %edx,%eax; shrl ,%edx"的目的是什么？

Question

我有以下 x86 汇编代码：

  movl   8(%ebp), %edx  //get an argument from the caller
  movl   [=11=], %eax
  testl  %edx, %edx
  je     .L1            
.L2:                   // what's the purpose of this loop body?
  xorl   %edx, %eax
  shrl   , %edx
  jne    .L2
.L1:
  andl   , %eax

课本给出的对应C代码如下

int f1(unsigned x)
{
    int y = 0;
    while(x != 0) {
        __________;
    }
    return __________;
 }

本书要求读者填空并回答"What does it do?"

的问题

我无法将循环体合并到一个 C 表达式中。我可以说出循环体的作用，但我不知道它的用途。教科书上还说这里的%eax存放的是return值。那么...

的目的是什么

andl  , %eax

我也不知道

Answer 1

看起来整个循环的目的是对 32 位 arg 中的所有位进行异或运算。即计算 parity.

从最后一条指令 (and ,%eax) 逆向计算，我们知道只有结果的低位重要。

考虑到这一点，xor %edx,%eax 变得更加清晰：将 %edx 的当前低位异或为 %eax。高垃圾无所谓

shr循环直到x的所有位都被移出。我们总是可以循环 32 次以获取所有位，但这比停止 x 为 0 时效率更低。（由于 XOR 的工作原理，我们不需要在 0 位中进行实际的 XOR；那没有效果。）

一旦我们知道了函数的作用，填写 C 就变成了巧妙/紧凑的 C 语法练习。起初我以为 y ^= (x>>=1); 会适合循环，但是在第一次使用它之前 x 会发生变化。

我认为在一个 C 语句中执行此操作的唯一方法是使用 , 运算符（它确实引入了 sequence point，因此读取左侧的 x 是安全的边，在右边修改一个,)。所以，y ^= x, x>>=1; 适合。

或者，为了使代码更具可读性，只需作弊并将两个语句放在同一行上并带有 ;.

int f1(unsigned x) { int y = 0; while(x != 0) { y ^= x; x>>=1; } return y & 1; }

这编译成与问题 中显示的基本相同的 asm，使用 gcc5.3 -O3 on the Godbolt compiler explorer. The question's code 到 mov [=27=], %eax，并优化 gcc 愚蠢的重复 ret 指令。（或者可能使用了没有这样做的早期版本的 gcc。）

循环非常低效：这是一种有效的方式：

我们不需要复杂度为 O(n) 的循环（其中 n 是 x 的位宽度）。相反，我们可以获得 O(log2(n)) 复杂度，并且实际上利用 x86 技巧只执行前两步。

对于由寄存器确定的指令，我省略了操作数大小后缀。（除了 xorw 使 16 位异或显式。）

#untested parity: # no frame-pointer boilerplate xor %eax,%eax # zero eax (so the upper 24 bits of the int return value are zeroed). And yes, this is more efficient than mov [=11=], %eax # so when we set %al later, the whole of %eax will be good. movzwl 4(%esp), %edx # load low 16 bits of `x`. (zero-extend into the full %edx is for efficiency. movw 4(%esp), %dx would work too. xorw 6(%esp), %dx # xor the high 16 bits of `x` # Two loads instead of a load + copy + shift is probably a win, because cache is fast. xor %dh, %dl # xor the two 8 bit halves, setting PF according to the result setnp %al # get the inverse of the CPU's parity flag. Remember that the rest of %eax is already zero, so the result is already zero-extended to 32-bits (int return value) ret

是的，没错，x86 has a parity flag (PF) that's updated from the low 8 bits of the result of every instruction that "sets flags according to the result", like xor。

我们使用 np 条件，因为 PF = 1 表示偶校验：所有位的异或 = 0。我们需要 return 0 的倒数来实现偶校验。

为了利用它，我们通过将高半部分降低到低半部分并组合来进行 SIMD 样式的水平缩减，重复两次以将 32 位减少到 8 位。

在设置标志的指令之前将 eax 置零（使用异或）比设置标志/setp %al/movzbl %al, %eax 稍微更有效，正如我在中解释的那样。

或者，正如@EOF 指出的那样，如果 CPUID POPCNT feature bit is set，您可以使用 popcnt 并测试低位以查看设置的位数是偶数还是奇数。（另一种看待这个问题的方式：xor 是不带进位的加法，所以无论是将所有位异或还是水平相加，低位都是相同的）。

GNU C 也有 __builtin_parity 和 __builtin_popcnt，如果你告诉编译器编译目标支持它（使用 -march=... 或 -mpopcnt），它们会使用硬件指令，否则编译为目标机器的有效序列。 Intel 内部函数总是编译为机器指令，而不是回退序列，在没有适当的 -mpopcnt 目标选项的情况下使用它们是编译时错误。

不幸的是，gcc 无法将纯 C 循环识别为奇偶校验计算并将其优化到此。一些编译器（比如 clang 和可能的 gcc）可以识别某些类型的 popcount 习语，并将它们优化为 popcnt 指令，但这种模式识别不会在这种情况下发生。 :(

See these on godbolt.

int parity_gnuc(unsigned x) { return __builtin_parity(x); } # with -mpopcnt, compiles the same as below # without popcnt, compiles to the same upper/lower half XOR algorithm I used, and a setnp # using one load and mov/shift for the 32->16 step, and still %dh, %dl for the 16->8 step. #ifdef __POPCNT__ #include <immintrin.h> int parity_popcnt(unsigned x) { return _mm_popcnt_u32(x) & 1; } #endif # gcc does compile this to the optimal code: popcnt 4(%esp), %eax and , %eax ret

另请参阅 x86 标签 wiki 中的其他链接。

循环"xorl %edx,%eax; shrl ,%edx"的目的是什么？

What's the purpose of looping "xorl %edx,%eax; shrl $1,%edx"?

assembly

x86

reverse-engineering

decompiling

att

循环非常低效：这是一种有效的方式：