根据 gcc -O2,汇编为什么 "lea eax, [eax + eax*const]; shl eax, eax, const;" 比 "imul eax, eax, const" 组合得更快?
Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?
我正在使用 Godbolt 来组装以下程序:
#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
res = a * 36;
return 1;
}
如果我用-Os优化,生成的代码自然是:
mov eax, DWORD PTR a[rip]
imul eax, eax, 36
mov DWORD PTR res[rip], eax
但是如果我使用-O2,生成的代码是这样的:
mov eax, DWORD PTR a[rip]
lea eax, [rax+rax*8]
sal eax, 2
mov DWORD PTR res[rip], eax
所以它不是乘以 5*36,而是 5 -> 5+5*8=45 -> 45*4 = 180。我认为这是因为 1 imul 比 1 lea + 1 左移慢。
但是在lea指令中,需要计算rax+rax*8
,其中包含1个加法+1个乘法。那么为什么它仍然比 1 imul 快呢?是因为 lea 内部的内存寻址是免费的吗?
编辑 1: 另外,[rax + rax*8]
是如何翻译成机器码的?它会被编译成额外的 2 条指令 (shl, rbx, rax, 3; add rax, rax, rbx;
) 还是其他指令?
编辑2: 下面的结果令人惊讶。我做了一个循环,然后使用-O2生成代码,然后复制文件并替换上面的代码段来自 -Os。所以 2 个汇编文件在任何地方都是相同的,除了我们进行基准测试的指令。 运行 在 Windows 上,命令是
gcc mul.c -O2 -S -masm=intel -o mulo2.s
gcc mulo2.s -o mulo2
// replace line of code in mulo2.s, save as muls.s
gcc muls.s -o muls
cmd /v:on /c "echo !time! & START "TestAgente" /W mulo2 & echo !time!"
cmd /v:on /c "echo !time! & START "TestAgente" /W muls & echo !time!"
#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
size_t LOOP = 1000 * 1000 * 1000;
LOOP = LOOP * 10;
size_t i = 0;
while (i < LOOP) {
i++;
res = a * 36;
}
return 0;
}
; mulo2.s
.file "mul.c"
.intel_syntax noprefix
.text
.def __main; .scl 2; .type 32; .endef
.section .text.startup,"x"
.p2align 4
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
call __main
movabs rdx, 10000000000
.p2align 4,,10
.p2align 3
.L2:
mov eax, DWORD PTR a[rip]
lea eax, [rax+rax*8] ; replaces these 2 lines with
sal eax, 2 ; imul eax, eax, 36
mov DWORD PTR res[rip], eax
sub rdx, 1
jne .L2
xor eax, eax
add rsp, 40
ret
.seh_endproc
.globl res
.bss
.align 4
res:
.space 4
.globl a
.data
.align 4
a:
.long 5
.ident "GCC: (GNU) 9.3.0"
令人惊讶的是,结果是 -Os
版本 始终 比 -O2
快(平均 4.1s vs 5s,Intel 8750H CPU,每个.exe文件都是运行几次)。所以在这种情况下,编译器优化错误。有人可以根据这个基准提供新的解释吗?
编辑 3: 为了测量指令缓存行的效果,这里有一个 python 脚本通过添加 nop
为主循环生成不同的地址在主循环之前对程序的指令。用于Window,用于Linux只需要稍微修改一下即可。
#cd "D:\Learning\temp"
import os
import time
import datetime as dt
f = open("mulo2.s","r")
lines = [line for line in f]
f.close()
def addNop(cnt, outputname):
f = open(outputname, "w")
for i in range(17):
f.write(lines[i])
for i in range(cnt):
f.write("\tnop\n")
for i in range(17, len(lines)):
f.write(lines[i])
f.close()
if os.path.isdir("nop_files")==False:
os.mkdir("nop_files")
MAXN = 100
for t in range(MAXN+1):
sourceFile = "nop_files\mulo2_" + str(t) + ".s" # change \ to / on Linux
exeFile = "nop_files\mulo2_" + str(t)
if os.path.isfile(sourceFile)==False:
addNop(t, sourceFile)
os.system("gcc " + sourceFile + " -o " + exeFile)
runtime = os.popen("timecmd " + exeFile).read() # use time
print(str(t) + " nop: " + str(runtime))
Result:
0 nop: command took 0:0:4.96 (4.96s total)
1 nop: command took 0:0:4.94 (4.94s total)
2 nop: command took 0:0:4.90 (4.90s total)
3 nop: command took 0:0:4.90 (4.90s total)
4 nop: command took 0:0:5.26 (5.26s total)
5 nop: command took 0:0:4.94 (4.94s total)
6 nop: command took 0:0:4.92 (4.92s total)
7 nop: command took 0:0:4.98 (4.98s total)
8 nop: command took 0:0:5.02 (5.02s total)
9 nop: command took 0:0:4.97 (4.97s total)
10 nop: command took 0:0:5.12 (5.12s total)
11 nop: command took 0:0:5.01 (5.01s total)
12 nop: command took 0:0:5.01 (5.01s total)
13 nop: command took 0:0:5.07 (5.07s total)
14 nop: command took 0:0:5.08 (5.08s total)
15 nop: command took 0:0:5.07 (5.07s total)
16 nop: command took 0:0:5.09 (5.09s total)
17 nop: command took 0:0:7.96 (7.96s total) # slow 17
18 nop: command took 0:0:7.93 (7.93s total)
19 nop: command took 0:0:7.88 (7.88s total)
20 nop: command took 0:0:7.88 (7.88s total)
21 nop: command took 0:0:7.94 (7.94s total)
22 nop: command took 0:0:7.90 (7.90s total)
23 nop: command took 0:0:7.92 (7.92s total)
24 nop: command took 0:0:7.99 (7.99s total)
25 nop: command took 0:0:7.89 (7.89s total)
26 nop: command took 0:0:7.88 (7.88s total)
27 nop: command took 0:0:7.88 (7.88s total)
28 nop: command took 0:0:7.84 (7.84s total)
29 nop: command took 0:0:7.84 (7.84s total)
30 nop: command took 0:0:7.88 (7.88s total)
31 nop: command took 0:0:7.91 (7.91s total)
32 nop: command took 0:0:7.89 (7.89s total)
33 nop: command took 0:0:7.88 (7.88s total)
34 nop: command took 0:0:7.94 (7.94s total)
35 nop: command took 0:0:7.81 (7.81s total)
36 nop: command took 0:0:7.89 (7.89s total)
37 nop: command took 0:0:7.90 (7.90s total)
38 nop: command took 0:0:7.92 (7.92s total)
39 nop: command took 0:0:7.83 (7.83s total)
40 nop: command took 0:0:4.95 (4.95s total) # fast 40
41 nop: command took 0:0:4.91 (4.91s total)
42 nop: command took 0:0:4.97 (4.97s total)
43 nop: command took 0:0:4.97 (4.97s total)
44 nop: command took 0:0:4.97 (4.97s total)
45 nop: command took 0:0:5.11 (5.11s total)
46 nop: command took 0:0:5.13 (5.13s total)
47 nop: command took 0:0:5.01 (5.01s total)
48 nop: command took 0:0:5.01 (5.01s total)
49 nop: command took 0:0:4.97 (4.97s total)
50 nop: command took 0:0:5.03 (5.03s total)
51 nop: command took 0:0:5.32 (5.32s total)
52 nop: command took 0:0:4.95 (4.95s total)
53 nop: command took 0:0:4.97 (4.97s total)
54 nop: command took 0:0:4.94 (4.94s total)
55 nop: command took 0:0:4.99 (4.99s total)
56 nop: command took 0:0:4.99 (4.99s total)
57 nop: command took 0:0:5.04 (5.04s total)
58 nop: command took 0:0:4.97 (4.97s total)
59 nop: command took 0:0:4.97 (4.97s total)
60 nop: command took 0:0:4.95 (4.95s total)
61 nop: command took 0:0:4.99 (4.99s total)
62 nop: command took 0:0:4.94 (4.94s total)
63 nop: command took 0:0:4.94 (4.94s total)
64 nop: command took 0:0:4.92 (4.92s total)
65 nop: command took 0:0:4.91 (4.91s total)
66 nop: command took 0:0:4.98 (4.98s total)
67 nop: command took 0:0:4.93 (4.93s total)
68 nop: command took 0:0:4.95 (4.95s total)
69 nop: command took 0:0:4.92 (4.92s total)
70 nop: command took 0:0:4.93 (4.93s total)
71 nop: command took 0:0:4.97 (4.97s total)
72 nop: command took 0:0:4.93 (4.93s total)
73 nop: command took 0:0:4.94 (4.94s total)
74 nop: command took 0:0:4.96 (4.96s total)
75 nop: command took 0:0:4.91 (4.91s total)
76 nop: command took 0:0:4.92 (4.92s total)
77 nop: command took 0:0:4.91 (4.91s total)
78 nop: command took 0:0:5.03 (5.03s total)
79 nop: command took 0:0:4.96 (4.96s total)
80 nop: command took 0:0:5.20 (5.20s total)
81 nop: command took 0:0:7.93 (7.93s total) # slow 81
82 nop: command took 0:0:7.88 (7.88s total)
83 nop: command took 0:0:7.85 (7.85s total)
84 nop: command took 0:0:7.91 (7.91s total)
85 nop: command took 0:0:7.93 (7.93s total)
86 nop: command took 0:0:8.06 (8.06s total)
87 nop: command took 0:0:8.03 (8.03s total)
88 nop: command took 0:0:7.85 (7.85s total)
89 nop: command took 0:0:7.88 (7.88s total)
90 nop: command took 0:0:7.91 (7.91s total)
91 nop: command took 0:0:7.86 (7.86s total)
92 nop: command took 0:0:7.99 (7.99s total)
93 nop: command took 0:0:7.86 (7.86s total)
94 nop: command took 0:0:7.91 (7.91s total)
95 nop: command took 0:0:8.12 (8.12s total)
96 nop: command took 0:0:7.88 (7.88s total)
97 nop: command took 0:0:7.81 (7.81s total)
98 nop: command took 0:0:7.88 (7.88s total)
99 nop: command took 0:0:7.85 (7.85s total)
100 nop: command took 0:0:7.90 (7.90s total)
101 nop: command took 0:0:7.93 (7.93s total)
102 nop: command took 0:0:7.85 (7.85s total)
103 nop: command took 0:0:7.88 (7.88s total)
104 nop: command took 0:0:5.00 (5.00s total) # fast 104
105 nop: command took 0:0:5.03 (5.03s total)
106 nop: command took 0:0:4.97 (4.97s total)
107 nop: command took 0:0:5.06 (5.06s total)
108 nop: command took 0:0:5.01 (5.01s total)
109 nop: command took 0:0:5.00 (5.00s total)
110 nop: command took 0:0:4.95 (4.95s total)
111 nop: command took 0:0:4.91 (4.91s total)
112 nop: command took 0:0:4.94 (4.94s total)
113 nop: command took 0:0:4.93 (4.93s total)
114 nop: command took 0:0:4.92 (4.92s total)
115 nop: command took 0:0:4.92 (4.92s total)
116 nop: command took 0:0:4.92 (4.92s total)
117 nop: command took 0:0:5.13 (5.13s total)
118 nop: command took 0:0:4.94 (4.94s total)
119 nop: command took 0:0:4.97 (4.97s total)
120 nop: command took 0:0:5.14 (5.14s total)
121 nop: command took 0:0:4.94 (4.94s total)
122 nop: command took 0:0:5.17 (5.17s total)
123 nop: command took 0:0:4.95 (4.95s total)
124 nop: command took 0:0:4.97 (4.97s total)
125 nop: command took 0:0:4.99 (4.99s total)
126 nop: command took 0:0:5.20 (5.20s total)
127 nop: command took 0:0:5.23 (5.23s total)
128 nop: command took 0:0:5.19 (5.19s total)
129 nop: command took 0:0:5.21 (5.21s total)
130 nop: command took 0:0:5.33 (5.33s total)
131 nop: command took 0:0:4.92 (4.92s total)
132 nop: command took 0:0:5.02 (5.02s total)
133 nop: command took 0:0:4.90 (4.90s total)
134 nop: command took 0:0:4.93 (4.93s total)
135 nop: command took 0:0:4.99 (4.99s total)
136 nop: command took 0:0:5.08 (5.08s total)
137 nop: command took 0:0:5.02 (5.02s total)
138 nop: command took 0:0:5.15 (5.15s total)
139 nop: command took 0:0:5.07 (5.07s total)
140 nop: command took 0:0:5.03 (5.03s total)
141 nop: command took 0:0:4.94 (4.94s total)
142 nop: command took 0:0:4.92 (4.92s total)
143 nop: command took 0:0:4.96 (4.96s total)
144 nop: command took 0:0:4.92 (4.92s total)
145 nop: command took 0:0:7.86 (7.86s total) # slow 145
146 nop: command took 0:0:7.87 (7.87s total)
147 nop: command took 0:0:7.83 (7.83s total)
148 nop: command took 0:0:7.83 (7.83s total)
149 nop: command took 0:0:7.84 (7.84s total)
150 nop: command took 0:0:7.87 (7.87s total)
151 nop: command took 0:0:7.84 (7.84s total)
152 nop: command took 0:0:7.88 (7.88s total)
153 nop: command took 0:0:7.87 (7.87s total)
154 nop: command took 0:0:7.83 (7.83s total)
155 nop: command took 0:0:7.85 (7.85s total)
156 nop: command took 0:0:7.91 (7.91s total)
157 nop: command took 0:0:8.18 (8.18s total)
158 nop: command took 0:0:7.94 (7.94s total)
159 nop: command took 0:0:7.92 (7.92s total)
160 nop: command took 0:0:7.92 (7.92s total)
161 nop: command took 0:0:7.97 (7.97s total)
162 nop: command took 0:0:8.12 (8.12s total)
163 nop: command took 0:0:7.89 (7.89s total)
164 nop: command took 0:0:7.92 (7.92s total)
165 nop: command took 0:0:7.88 (7.88s total)
166 nop: command took 0:0:7.80 (7.80s total)
167 nop: command took 0:0:7.82 (7.82s total)
168 nop: command took 0:0:4.97 (4.97s total) # fast
169 nop: command took 0:0:4.97 (4.97s total)
170 nop: command took 0:0:4.95 (4.95s total)
171 nop: command took 0:0:5.00 (5.00s total)
172 nop: command took 0:0:4.95 (4.95s total)
173 nop: command took 0:0:4.93 (4.93s total)
174 nop: command took 0:0:4.91 (4.91s total)
175 nop: command took 0:0:4.92 (4.92s total)
程序由快转慢(再由慢转快)的点为:17S-40F-81S-104F-145S-168F。我们可以看到slow->fast代码的距离是23nop
,fast->slow代码的距离是41nop
。查看objdump,我们可以看到主循环占用了24个字节;这意味着如果我们将它放在缓存行的开头 (address mod 64 == 0
),插入 41 个字节将导致主循环跨越缓存行边界,从而导致速度变慢。所以在默认代码中(没有添加nop
),主循环已经在同一个缓存行中。
所以我们知道-O2
版本慢不是因为指令地址对齐。 唯一的罪魁祸首是指令解码速度 我们发现了一个新的罪魁祸首,比如@Jérôme Richard 的回答。
编辑 4: Skylake 每个周期解码 16 个字节。但是,-Os
和-O2
版本的大小分别是21和24,所以都需要2个周期来读取主循环。那么速度差异从何而来?
结论: 虽然编译器在理论上是正确的(lea + sal
是 2 条超级便宜的指令,并且 lea 内部的寻址是免费的,因为它使用单独的硬件电路),实际上,由于有关 CPU 架构的一些极其复杂的细节,包括指令解码速度、微操作 (uops) 数量和 CPU 端口,因此 1 条昂贵的指令 imul
可能会更快。
大部分主流架构的指令开销都能看到here and there。基于此并假设您使用例如 Intel Skylake 处理器,您可以看到每个周期可以计算一个 32 位 imul
指令,但延迟为 3 个周期。在优化代码中,每个周期可以执行 2 lea
条指令(非常便宜),延迟为 1 个周期。同样的事情适用于 sal
指令(每个周期 2 个和 1 个延迟周期)。
这意味着优化版本仅需 2 个周期的延迟即可执行,而第一个版本需要 3 个周期的延迟(不考虑相同的 load/store 指令)。此外,由于超标量乱序执行,两条指令可以针对两个不同的输入数据并行执行,因此第二个版本可以更好地流水线化。请注意,尽管每个周期只能并行执行一个存储,但也可以并行执行两个加载。这意味着执行受存储指令吞吐量的限制。总的来说,每个周期只能计算 1 个值。 AFAIK,最近的 Intel Icelake 处理器可以像新的 AMD Ryzen 处理器一样并行处理两个存储。第二个预计在所选用例(英特尔 Skylake 处理器)上同样快或可能更快。它在最新的 x86-64 处理器上应该快得多。
请注意,lea
指令非常快,因为乘加是在专用的 CPU 单元(硬连线移位器)上完成的,并且它仅支持某些 特定的constant 用于乘法(支持因子为 1、2、4 和 8,这意味着 lea 可用于将整数乘以常数 2、3、4、5、8 和 9)。这就是 lea
比 imul
/mul
.
快的原因
更新(v2):
我可以使用 GCC 11.2(在 Linux 上使用 i5-9600KF 处理器)重现 -O2
执行速度较慢的情况。
减速的主要来源来自 在 -O2
版本 [=87] 中执行的 micro-operations (uops) 数量更多=]当然结合某些执行端口的饱和肯定是由于不良的微操作调度。
这里是 -Os
:
循环的汇编
1049: 8b 15 d9 2f 00 00 mov edx,DWORD PTR [rip+0x2fd9] # 4028 <a>
104f: 6b d2 24 imul edx,edx,0x24
1052: 89 15 d8 2f 00 00 mov DWORD PTR [rip+0x2fd8],edx # 4030 <res>
1058: 48 ff c8 dec rax
105b: 75 ec jne 1049 <main+0x9>
这里是 -O2
:
循环的汇编
1050: 8b 05 d2 2f 00 00 mov eax,DWORD PTR [rip+0x2fd2] # 4028 <a>
1056: 8d 04 c0 lea eax,[rax+rax*8]
1059: c1 e0 02 shl eax,0x2
105c: 89 05 ce 2f 00 00 mov DWORD PTR [rip+0x2fce],eax # 4030 <res>
1062: 48 83 ea 01 sub rdx,0x1
1066: 75 e8 jne 1050 <main+0x10>
现代 x86-64 处理器,解码(可变大小)指令,然后将它们转换为(更简单的固定大小)微操作 最终在几个 执行端口 上执行(通常并行执行)。有关特定 Skylake 体系结构的更多信息,请参阅 here. Skylake can macro-fuse 多条指令变成一个微操作。在这种情况下,dec
+jne
和 sub
+jne
指令在每种情况下都融合为一个微指令。这意味着 -Os
版本执行 4 uops/iteration 而 -O2
执行 5 uops/iteration.
微指令存储在称为解码流缓冲区 (DSB) 的 uop 缓存 中,这样处理器就不需要 decode/translate 再次执行指令(小)循环。要执行的缓存微指令在称为指令解码队列 (IDQ) 的队列中发送。最多可以从 DSB 向 IDQ 发送 6 uops/cycle。对于 -Os
版本,每个周期只有 4 微指令的 DSB 被发送到 IDQ(可能是因为循环以饱和的存储端口为界)。对于 -O2
版本,DSB 的 5 微指令每个周期仅发送到 IDQ,但 5 次中有 4 次(平均)!这意味着 每 4 个周期增加 1 个延迟周期,导致执行速度减慢 25%。这种影响的原因尚不清楚,似乎与 uops 调度有关。
Uops 然后被发送到资源分配 Table (RAT) 并且 发出 到保留站 (RS)。 RS 调度微指令到执行它们的端口。然后,uops retired(即提交)。从 DSB 间接传输到 RS 的微指令数对于两个版本都是恒定的。相同数量的 uops 被淘汰。但是,在两个版本中,RS 每个周期(并由端口执行)都会分派 1 个 ghost uop。这可能是用于计算商店地址的微指令(因为商店端口没有自己专用的 AGU)。
这是从硬件计数器收集的每次迭代的统计数据(使用 perf
):
version | instruction | issued-uops | executed-uops | retired-uops | cycles
"-Os" | 5 | 4 | 5 | 4 | 1.00
"-O2" | 6 | 5 | 6 | 5 | 1.25
这里是整体端口利用率的统计数据:
port | type | "-Os" | "-O2"
-----------------------------------------
0 | ALU/BR | 0% | 60%
1 | ALU/MUL/LEA | 100% | 38%
2 | LOAD/AGU | 65% | 60%
3 | LOAD/AGU | 73% | 60%
4 | STORE | 100% | 80%
5 | ALU/LEA | 0% | 42%
6 | ALU/BR | 100% | 100%
7 | AGU | 62% | 40%
-----------------------------------------
total | | 500% | 480%
端口 6 仅在 -O2
版本上完全饱和,这是出乎意料的,这当然解释了为什么每 5 个周期需要一个额外的周期。请注意,只有与指令 shl
和 sub+jne
关联的微指令正在(同时)使用端口 0 和 6(没有其他端口)。
请注意,由于停滞周期,总共 480% 是调度工件。实际上,6*4=24
微指令应该每 5 个周期执行一次 (24/5*100=480
)。另请注意,5 个周期中有 1 个不需要存储端口(平均每 5 个周期执行 4 次迭代,因此 4 个存储微指令),因此它的使用率为 80%。
相关:
- What's the purpose of the LEA instruction?
- https://en.wikipedia.org/wiki/Superscalar_processor
tl;dr:因为 LEA 不做完整的乘法。
虽然@JeromeRichard 的回答是正确的,但其最后一句话隐藏了真相的基本内核:使用 LEA,您只能乘以特定常数,即 2 的幂。因此,不需要大型专用电路来进行乘法运算,它只需要一个小型子电路来将其操作数之一移动固定量。
我正在使用 Godbolt 来组装以下程序:
#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
res = a * 36;
return 1;
}
如果我用-Os优化,生成的代码自然是:
mov eax, DWORD PTR a[rip]
imul eax, eax, 36
mov DWORD PTR res[rip], eax
但是如果我使用-O2,生成的代码是这样的:
mov eax, DWORD PTR a[rip]
lea eax, [rax+rax*8]
sal eax, 2
mov DWORD PTR res[rip], eax
所以它不是乘以 5*36,而是 5 -> 5+5*8=45 -> 45*4 = 180。我认为这是因为 1 imul 比 1 lea + 1 左移慢。
但是在lea指令中,需要计算rax+rax*8
,其中包含1个加法+1个乘法。那么为什么它仍然比 1 imul 快呢?是因为 lea 内部的内存寻址是免费的吗?
编辑 1: 另外,[rax + rax*8]
是如何翻译成机器码的?它会被编译成额外的 2 条指令 (shl, rbx, rax, 3; add rax, rax, rbx;
) 还是其他指令?
编辑2: 下面的结果令人惊讶。我做了一个循环,然后使用-O2生成代码,然后复制文件并替换上面的代码段来自 -Os。所以 2 个汇编文件在任何地方都是相同的,除了我们进行基准测试的指令。 运行 在 Windows 上,命令是
gcc mul.c -O2 -S -masm=intel -o mulo2.s
gcc mulo2.s -o mulo2
// replace line of code in mulo2.s, save as muls.s
gcc muls.s -o muls
cmd /v:on /c "echo !time! & START "TestAgente" /W mulo2 & echo !time!"
cmd /v:on /c "echo !time! & START "TestAgente" /W muls & echo !time!"
#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
size_t LOOP = 1000 * 1000 * 1000;
LOOP = LOOP * 10;
size_t i = 0;
while (i < LOOP) {
i++;
res = a * 36;
}
return 0;
}
; mulo2.s
.file "mul.c"
.intel_syntax noprefix
.text
.def __main; .scl 2; .type 32; .endef
.section .text.startup,"x"
.p2align 4
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
call __main
movabs rdx, 10000000000
.p2align 4,,10
.p2align 3
.L2:
mov eax, DWORD PTR a[rip]
lea eax, [rax+rax*8] ; replaces these 2 lines with
sal eax, 2 ; imul eax, eax, 36
mov DWORD PTR res[rip], eax
sub rdx, 1
jne .L2
xor eax, eax
add rsp, 40
ret
.seh_endproc
.globl res
.bss
.align 4
res:
.space 4
.globl a
.data
.align 4
a:
.long 5
.ident "GCC: (GNU) 9.3.0"
令人惊讶的是,结果是 -Os
版本 始终 比 -O2
快(平均 4.1s vs 5s,Intel 8750H CPU,每个.exe文件都是运行几次)。所以在这种情况下,编译器优化错误。有人可以根据这个基准提供新的解释吗?
编辑 3: 为了测量指令缓存行的效果,这里有一个 python 脚本通过添加 nop
为主循环生成不同的地址在主循环之前对程序的指令。用于Window,用于Linux只需要稍微修改一下即可。
#cd "D:\Learning\temp"
import os
import time
import datetime as dt
f = open("mulo2.s","r")
lines = [line for line in f]
f.close()
def addNop(cnt, outputname):
f = open(outputname, "w")
for i in range(17):
f.write(lines[i])
for i in range(cnt):
f.write("\tnop\n")
for i in range(17, len(lines)):
f.write(lines[i])
f.close()
if os.path.isdir("nop_files")==False:
os.mkdir("nop_files")
MAXN = 100
for t in range(MAXN+1):
sourceFile = "nop_files\mulo2_" + str(t) + ".s" # change \ to / on Linux
exeFile = "nop_files\mulo2_" + str(t)
if os.path.isfile(sourceFile)==False:
addNop(t, sourceFile)
os.system("gcc " + sourceFile + " -o " + exeFile)
runtime = os.popen("timecmd " + exeFile).read() # use time
print(str(t) + " nop: " + str(runtime))
Result:
0 nop: command took 0:0:4.96 (4.96s total)
1 nop: command took 0:0:4.94 (4.94s total)
2 nop: command took 0:0:4.90 (4.90s total)
3 nop: command took 0:0:4.90 (4.90s total)
4 nop: command took 0:0:5.26 (5.26s total)
5 nop: command took 0:0:4.94 (4.94s total)
6 nop: command took 0:0:4.92 (4.92s total)
7 nop: command took 0:0:4.98 (4.98s total)
8 nop: command took 0:0:5.02 (5.02s total)
9 nop: command took 0:0:4.97 (4.97s total)
10 nop: command took 0:0:5.12 (5.12s total)
11 nop: command took 0:0:5.01 (5.01s total)
12 nop: command took 0:0:5.01 (5.01s total)
13 nop: command took 0:0:5.07 (5.07s total)
14 nop: command took 0:0:5.08 (5.08s total)
15 nop: command took 0:0:5.07 (5.07s total)
16 nop: command took 0:0:5.09 (5.09s total)
17 nop: command took 0:0:7.96 (7.96s total) # slow 17
18 nop: command took 0:0:7.93 (7.93s total)
19 nop: command took 0:0:7.88 (7.88s total)
20 nop: command took 0:0:7.88 (7.88s total)
21 nop: command took 0:0:7.94 (7.94s total)
22 nop: command took 0:0:7.90 (7.90s total)
23 nop: command took 0:0:7.92 (7.92s total)
24 nop: command took 0:0:7.99 (7.99s total)
25 nop: command took 0:0:7.89 (7.89s total)
26 nop: command took 0:0:7.88 (7.88s total)
27 nop: command took 0:0:7.88 (7.88s total)
28 nop: command took 0:0:7.84 (7.84s total)
29 nop: command took 0:0:7.84 (7.84s total)
30 nop: command took 0:0:7.88 (7.88s total)
31 nop: command took 0:0:7.91 (7.91s total)
32 nop: command took 0:0:7.89 (7.89s total)
33 nop: command took 0:0:7.88 (7.88s total)
34 nop: command took 0:0:7.94 (7.94s total)
35 nop: command took 0:0:7.81 (7.81s total)
36 nop: command took 0:0:7.89 (7.89s total)
37 nop: command took 0:0:7.90 (7.90s total)
38 nop: command took 0:0:7.92 (7.92s total)
39 nop: command took 0:0:7.83 (7.83s total)
40 nop: command took 0:0:4.95 (4.95s total) # fast 40
41 nop: command took 0:0:4.91 (4.91s total)
42 nop: command took 0:0:4.97 (4.97s total)
43 nop: command took 0:0:4.97 (4.97s total)
44 nop: command took 0:0:4.97 (4.97s total)
45 nop: command took 0:0:5.11 (5.11s total)
46 nop: command took 0:0:5.13 (5.13s total)
47 nop: command took 0:0:5.01 (5.01s total)
48 nop: command took 0:0:5.01 (5.01s total)
49 nop: command took 0:0:4.97 (4.97s total)
50 nop: command took 0:0:5.03 (5.03s total)
51 nop: command took 0:0:5.32 (5.32s total)
52 nop: command took 0:0:4.95 (4.95s total)
53 nop: command took 0:0:4.97 (4.97s total)
54 nop: command took 0:0:4.94 (4.94s total)
55 nop: command took 0:0:4.99 (4.99s total)
56 nop: command took 0:0:4.99 (4.99s total)
57 nop: command took 0:0:5.04 (5.04s total)
58 nop: command took 0:0:4.97 (4.97s total)
59 nop: command took 0:0:4.97 (4.97s total)
60 nop: command took 0:0:4.95 (4.95s total)
61 nop: command took 0:0:4.99 (4.99s total)
62 nop: command took 0:0:4.94 (4.94s total)
63 nop: command took 0:0:4.94 (4.94s total)
64 nop: command took 0:0:4.92 (4.92s total)
65 nop: command took 0:0:4.91 (4.91s total)
66 nop: command took 0:0:4.98 (4.98s total)
67 nop: command took 0:0:4.93 (4.93s total)
68 nop: command took 0:0:4.95 (4.95s total)
69 nop: command took 0:0:4.92 (4.92s total)
70 nop: command took 0:0:4.93 (4.93s total)
71 nop: command took 0:0:4.97 (4.97s total)
72 nop: command took 0:0:4.93 (4.93s total)
73 nop: command took 0:0:4.94 (4.94s total)
74 nop: command took 0:0:4.96 (4.96s total)
75 nop: command took 0:0:4.91 (4.91s total)
76 nop: command took 0:0:4.92 (4.92s total)
77 nop: command took 0:0:4.91 (4.91s total)
78 nop: command took 0:0:5.03 (5.03s total)
79 nop: command took 0:0:4.96 (4.96s total)
80 nop: command took 0:0:5.20 (5.20s total)
81 nop: command took 0:0:7.93 (7.93s total) # slow 81
82 nop: command took 0:0:7.88 (7.88s total)
83 nop: command took 0:0:7.85 (7.85s total)
84 nop: command took 0:0:7.91 (7.91s total)
85 nop: command took 0:0:7.93 (7.93s total)
86 nop: command took 0:0:8.06 (8.06s total)
87 nop: command took 0:0:8.03 (8.03s total)
88 nop: command took 0:0:7.85 (7.85s total)
89 nop: command took 0:0:7.88 (7.88s total)
90 nop: command took 0:0:7.91 (7.91s total)
91 nop: command took 0:0:7.86 (7.86s total)
92 nop: command took 0:0:7.99 (7.99s total)
93 nop: command took 0:0:7.86 (7.86s total)
94 nop: command took 0:0:7.91 (7.91s total)
95 nop: command took 0:0:8.12 (8.12s total)
96 nop: command took 0:0:7.88 (7.88s total)
97 nop: command took 0:0:7.81 (7.81s total)
98 nop: command took 0:0:7.88 (7.88s total)
99 nop: command took 0:0:7.85 (7.85s total)
100 nop: command took 0:0:7.90 (7.90s total)
101 nop: command took 0:0:7.93 (7.93s total)
102 nop: command took 0:0:7.85 (7.85s total)
103 nop: command took 0:0:7.88 (7.88s total)
104 nop: command took 0:0:5.00 (5.00s total) # fast 104
105 nop: command took 0:0:5.03 (5.03s total)
106 nop: command took 0:0:4.97 (4.97s total)
107 nop: command took 0:0:5.06 (5.06s total)
108 nop: command took 0:0:5.01 (5.01s total)
109 nop: command took 0:0:5.00 (5.00s total)
110 nop: command took 0:0:4.95 (4.95s total)
111 nop: command took 0:0:4.91 (4.91s total)
112 nop: command took 0:0:4.94 (4.94s total)
113 nop: command took 0:0:4.93 (4.93s total)
114 nop: command took 0:0:4.92 (4.92s total)
115 nop: command took 0:0:4.92 (4.92s total)
116 nop: command took 0:0:4.92 (4.92s total)
117 nop: command took 0:0:5.13 (5.13s total)
118 nop: command took 0:0:4.94 (4.94s total)
119 nop: command took 0:0:4.97 (4.97s total)
120 nop: command took 0:0:5.14 (5.14s total)
121 nop: command took 0:0:4.94 (4.94s total)
122 nop: command took 0:0:5.17 (5.17s total)
123 nop: command took 0:0:4.95 (4.95s total)
124 nop: command took 0:0:4.97 (4.97s total)
125 nop: command took 0:0:4.99 (4.99s total)
126 nop: command took 0:0:5.20 (5.20s total)
127 nop: command took 0:0:5.23 (5.23s total)
128 nop: command took 0:0:5.19 (5.19s total)
129 nop: command took 0:0:5.21 (5.21s total)
130 nop: command took 0:0:5.33 (5.33s total)
131 nop: command took 0:0:4.92 (4.92s total)
132 nop: command took 0:0:5.02 (5.02s total)
133 nop: command took 0:0:4.90 (4.90s total)
134 nop: command took 0:0:4.93 (4.93s total)
135 nop: command took 0:0:4.99 (4.99s total)
136 nop: command took 0:0:5.08 (5.08s total)
137 nop: command took 0:0:5.02 (5.02s total)
138 nop: command took 0:0:5.15 (5.15s total)
139 nop: command took 0:0:5.07 (5.07s total)
140 nop: command took 0:0:5.03 (5.03s total)
141 nop: command took 0:0:4.94 (4.94s total)
142 nop: command took 0:0:4.92 (4.92s total)
143 nop: command took 0:0:4.96 (4.96s total)
144 nop: command took 0:0:4.92 (4.92s total)
145 nop: command took 0:0:7.86 (7.86s total) # slow 145
146 nop: command took 0:0:7.87 (7.87s total)
147 nop: command took 0:0:7.83 (7.83s total)
148 nop: command took 0:0:7.83 (7.83s total)
149 nop: command took 0:0:7.84 (7.84s total)
150 nop: command took 0:0:7.87 (7.87s total)
151 nop: command took 0:0:7.84 (7.84s total)
152 nop: command took 0:0:7.88 (7.88s total)
153 nop: command took 0:0:7.87 (7.87s total)
154 nop: command took 0:0:7.83 (7.83s total)
155 nop: command took 0:0:7.85 (7.85s total)
156 nop: command took 0:0:7.91 (7.91s total)
157 nop: command took 0:0:8.18 (8.18s total)
158 nop: command took 0:0:7.94 (7.94s total)
159 nop: command took 0:0:7.92 (7.92s total)
160 nop: command took 0:0:7.92 (7.92s total)
161 nop: command took 0:0:7.97 (7.97s total)
162 nop: command took 0:0:8.12 (8.12s total)
163 nop: command took 0:0:7.89 (7.89s total)
164 nop: command took 0:0:7.92 (7.92s total)
165 nop: command took 0:0:7.88 (7.88s total)
166 nop: command took 0:0:7.80 (7.80s total)
167 nop: command took 0:0:7.82 (7.82s total)
168 nop: command took 0:0:4.97 (4.97s total) # fast
169 nop: command took 0:0:4.97 (4.97s total)
170 nop: command took 0:0:4.95 (4.95s total)
171 nop: command took 0:0:5.00 (5.00s total)
172 nop: command took 0:0:4.95 (4.95s total)
173 nop: command took 0:0:4.93 (4.93s total)
174 nop: command took 0:0:4.91 (4.91s total)
175 nop: command took 0:0:4.92 (4.92s total)
程序由快转慢(再由慢转快)的点为:17S-40F-81S-104F-145S-168F。我们可以看到slow->fast代码的距离是23nop
,fast->slow代码的距离是41nop
。查看objdump,我们可以看到主循环占用了24个字节;这意味着如果我们将它放在缓存行的开头 (address mod 64 == 0
),插入 41 个字节将导致主循环跨越缓存行边界,从而导致速度变慢。所以在默认代码中(没有添加nop
),主循环已经在同一个缓存行中。
所以我们知道-O2
版本慢不是因为指令地址对齐。 唯一的罪魁祸首是指令解码速度 我们发现了一个新的罪魁祸首,比如@Jérôme Richard 的回答。
编辑 4: Skylake 每个周期解码 16 个字节。但是,-Os
和-O2
版本的大小分别是21和24,所以都需要2个周期来读取主循环。那么速度差异从何而来?
结论: 虽然编译器在理论上是正确的(lea + sal
是 2 条超级便宜的指令,并且 lea 内部的寻址是免费的,因为它使用单独的硬件电路),实际上,由于有关 CPU 架构的一些极其复杂的细节,包括指令解码速度、微操作 (uops) 数量和 CPU 端口,因此 1 条昂贵的指令 imul
可能会更快。
大部分主流架构的指令开销都能看到here and there。基于此并假设您使用例如 Intel Skylake 处理器,您可以看到每个周期可以计算一个 32 位 imul
指令,但延迟为 3 个周期。在优化代码中,每个周期可以执行 2 lea
条指令(非常便宜),延迟为 1 个周期。同样的事情适用于 sal
指令(每个周期 2 个和 1 个延迟周期)。
这意味着优化版本仅需 2 个周期的延迟即可执行,而第一个版本需要 3 个周期的延迟(不考虑相同的 load/store 指令)。此外,由于超标量乱序执行,两条指令可以针对两个不同的输入数据并行执行,因此第二个版本可以更好地流水线化。请注意,尽管每个周期只能并行执行一个存储,但也可以并行执行两个加载。这意味着执行受存储指令吞吐量的限制。总的来说,每个周期只能计算 1 个值。 AFAIK,最近的 Intel Icelake 处理器可以像新的 AMD Ryzen 处理器一样并行处理两个存储。第二个预计在所选用例(英特尔 Skylake 处理器)上同样快或可能更快。它在最新的 x86-64 处理器上应该快得多。
请注意,lea
指令非常快,因为乘加是在专用的 CPU 单元(硬连线移位器)上完成的,并且它仅支持某些 特定的constant 用于乘法(支持因子为 1、2、4 和 8,这意味着 lea 可用于将整数乘以常数 2、3、4、5、8 和 9)。这就是 lea
比 imul
/mul
.
更新(v2):
我可以使用 GCC 11.2(在 Linux 上使用 i5-9600KF 处理器)重现 -O2
执行速度较慢的情况。
减速的主要来源来自 在 -O2
版本 [=87] 中执行的 micro-operations (uops) 数量更多=]当然结合某些执行端口的饱和肯定是由于不良的微操作调度。
这里是 -Os
:
1049: 8b 15 d9 2f 00 00 mov edx,DWORD PTR [rip+0x2fd9] # 4028 <a>
104f: 6b d2 24 imul edx,edx,0x24
1052: 89 15 d8 2f 00 00 mov DWORD PTR [rip+0x2fd8],edx # 4030 <res>
1058: 48 ff c8 dec rax
105b: 75 ec jne 1049 <main+0x9>
这里是 -O2
:
1050: 8b 05 d2 2f 00 00 mov eax,DWORD PTR [rip+0x2fd2] # 4028 <a>
1056: 8d 04 c0 lea eax,[rax+rax*8]
1059: c1 e0 02 shl eax,0x2
105c: 89 05 ce 2f 00 00 mov DWORD PTR [rip+0x2fce],eax # 4030 <res>
1062: 48 83 ea 01 sub rdx,0x1
1066: 75 e8 jne 1050 <main+0x10>
现代 x86-64 处理器,解码(可变大小)指令,然后将它们转换为(更简单的固定大小)微操作 最终在几个 执行端口 上执行(通常并行执行)。有关特定 Skylake 体系结构的更多信息,请参阅 here. Skylake can macro-fuse 多条指令变成一个微操作。在这种情况下,dec
+jne
和 sub
+jne
指令在每种情况下都融合为一个微指令。这意味着 -Os
版本执行 4 uops/iteration 而 -O2
执行 5 uops/iteration.
微指令存储在称为解码流缓冲区 (DSB) 的 uop 缓存 中,这样处理器就不需要 decode/translate 再次执行指令(小)循环。要执行的缓存微指令在称为指令解码队列 (IDQ) 的队列中发送。最多可以从 DSB 向 IDQ 发送 6 uops/cycle。对于 -Os
版本,每个周期只有 4 微指令的 DSB 被发送到 IDQ(可能是因为循环以饱和的存储端口为界)。对于 -O2
版本,DSB 的 5 微指令每个周期仅发送到 IDQ,但 5 次中有 4 次(平均)!这意味着 每 4 个周期增加 1 个延迟周期,导致执行速度减慢 25%。这种影响的原因尚不清楚,似乎与 uops 调度有关。
Uops 然后被发送到资源分配 Table (RAT) 并且 发出 到保留站 (RS)。 RS 调度微指令到执行它们的端口。然后,uops retired(即提交)。从 DSB 间接传输到 RS 的微指令数对于两个版本都是恒定的。相同数量的 uops 被淘汰。但是,在两个版本中,RS 每个周期(并由端口执行)都会分派 1 个 ghost uop。这可能是用于计算商店地址的微指令(因为商店端口没有自己专用的 AGU)。
这是从硬件计数器收集的每次迭代的统计数据(使用 perf
):
version | instruction | issued-uops | executed-uops | retired-uops | cycles
"-Os" | 5 | 4 | 5 | 4 | 1.00
"-O2" | 6 | 5 | 6 | 5 | 1.25
这里是整体端口利用率的统计数据:
port | type | "-Os" | "-O2"
-----------------------------------------
0 | ALU/BR | 0% | 60%
1 | ALU/MUL/LEA | 100% | 38%
2 | LOAD/AGU | 65% | 60%
3 | LOAD/AGU | 73% | 60%
4 | STORE | 100% | 80%
5 | ALU/LEA | 0% | 42%
6 | ALU/BR | 100% | 100%
7 | AGU | 62% | 40%
-----------------------------------------
total | | 500% | 480%
端口 6 仅在 -O2
版本上完全饱和,这是出乎意料的,这当然解释了为什么每 5 个周期需要一个额外的周期。请注意,只有与指令 shl
和 sub+jne
关联的微指令正在(同时)使用端口 0 和 6(没有其他端口)。
请注意,由于停滞周期,总共 480% 是调度工件。实际上,6*4=24
微指令应该每 5 个周期执行一次 (24/5*100=480
)。另请注意,5 个周期中有 1 个不需要存储端口(平均每 5 个周期执行 4 次迭代,因此 4 个存储微指令),因此它的使用率为 80%。
相关:
- What's the purpose of the LEA instruction?
- https://en.wikipedia.org/wiki/Superscalar_processor
tl;dr:因为 LEA 不做完整的乘法。
虽然@JeromeRichard 的回答是正确的,但其最后一句话隐藏了真相的基本内核:使用 LEA,您只能乘以特定常数,即 2 的幂。因此,不需要大型专用电路来进行乘法运算,它只需要一个小型子电路来将其操作数之一移动固定量。