AND 运算符 + 加法比减法快

Question

我测量了以下代码的执行时间：

volatile int r = 768;
r -= 511;

volatile int r = 768;
r = (r & ~512) + 1;

程序集：

mov     eax, DWORD PTR [rbp-4]
sub     eax, 511
mov     DWORD PTR [rbp-4], eax

mov     eax, DWORD PTR [rbp-4]
and     ah, 253
add     eax, 1
mov     DWORD PTR [rbp-4], eax

结果：

Subtraction time: 141ns   
AND + addition: 53ns

我已经运行代码段多次，结果一致。
有人能解释一下为什么即使 AND + 加法版本多了一行汇编也是如此吗？

Answer 1

您断言一个片段比另一个片段快是错误的。
如果您查看代码：

mov     eax, DWORD PTR [rbp-4]
....
mov     DWORD PTR [rbp-4], eax

你会看到运行ning 时间由 load/store 内存支配。
即使在 Skylake 上，这也至少需要 2+2 = 4 个周期。
sub 的 1 个周期或 and bytereg/add full reg 的 3^*) 个周期完全消失在内存访问时间中。
在 Core2 等较旧的处理器上，对同一地址执行 load/store 对最少需要 5 个周期。

很难为如此短的代码序列计时，应注意确保您拥有正确的方法。
您还需要记住 rdstc 在 Intel 处理器上不准确并且运行无法启动。

If you use proper timing code like:

.... x 100,000    //stress the cpu using integercode in a 100,000 x loop to ensure it's running at 100%
cpuid             //serialize instruction to make sure rdtscp does not run early.
rdstcp            //use the serializing version to ensure it does not run late   
push eax
push edx
mov reg1,1000*1000   //time a minimum of 1,000,000 runs to ensure accuracy
loop:
...                  //insert code to time here
sub reg1,1           //don't use dec, it causes a partial register stall on the flags.
jnz loop             //loop
//kernel mode only!
//mov eax,cr0          //reading and writing to cr0 serializes as well.
//mov cr0,eax
cpuid                //serialization in user mode.
rdstcp               //make sure to use the 'p' version of rdstc.
push eax
push edx
pop 4x               //retrieve the start and end times from the stack.

运行时序代码 a 100x 并取最低周期计数。
现在您可以在 1 或 2 个周期内进行准确计数。
您还需要为一个空循环计时并减去该循环的时间，以便您可以看到执行相关指令所花费的净时间。

如果你这样做你会发现 add 和 sub 运行以完全相同的速度，就像它 does/did 在每个 x86/x64 CPU自8086.
当然，这也是 Agner Fog, the Intel CPU manuals, the AMD cpu manuals, and just about any other source available 所说的。

*) and ah,value 需要 1 个周期，然后 CPU 由于部分寄存器写入而停止 1 个周期，而 add eax,value 需要另一个周期。

优化代码

sub     DWORD PTR [rbp-4],511

如果您不需要在其他地方重用该值，可能会更快，延迟很慢，只有 5 个周期，但相互吞吐量是 1 个周期，这比您的任何一个版本都要好得多。

Answer 2

完整的机器码是

8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
2d ff 01 00 00          sub    eax,0x1ff
89 45 fc                mov    DWORD PTR [rbp-0x4],eax

对

8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
80 e4 fd                and    ah,0xfd
83 c0 01                add    eax,0x1
89 45 fc                mov    DWORD PTR [rbp-0x4],eax

这意味着第二个操作的代码实际上只长了一个字节（11 对 12）。 CPU 很可能以更大的字节为单位获取代码，因此获取速度不会慢很多。它还可以同时解码多条指令，因此第一个样本也没有优势。执行单个 add、and 或 sub 各占用一个 ALU 通过，因此它们在单个执行单元上只占用一个时钟。在 1GHz CPU.

上，这对您来说是 1 ns 的优势

所以基本上这两种操作或多或少是一样的。差异可能归因于其他一些因素。也许内存单元 rbp-0x4 在您的运行第二个代码片段之前仍在 L1 缓存中。或者第一个片段的指令在内存中的位置较差。或者 CPU 能够运行在你开始测量之前推测第二个片段等，你需要知道你是如何测量速度等来决定的。

AND 运算符 + 加法比减法快

AND operator + addition faster than a subtraction

c++

assembly

execution-time