将 IACA 与非汇编例程一起使用
Using IACA with non-assembly routine
我一直在玩 IACA(英特尔的静态代码分析器)。
在使用我可以手动输入魔术标记字节的程序集片段进行测试时,它工作正常,如下所示:
procedure TSlice.BitSwap(a, b: integer);
asm
//RCX = self
//edx = a
//r8d = b
mov ebx, 111 // Start IACA marker bytes
db , , // Start IACA marker bytes
xor eax, eax
xor r10d, r10d
mov r9d, [rcx] // read the value
mov ecx,edx // need a in cl for the shift
btr r9d, edx // read and clear the a bit
setc al // convert cf to bit
shl eax, cl // shift bit to ecx position
btr r9d, r8d // read and clear the b bit
mov ecx, r8d // need b in ecx for shift
setc r10b // convert cf to bit
shl r10d, cl // shift bit to edx position
or r9d, eax // copy in old edx bit
or r9d, r10d // copy in old ecx bit
mov [r8], r9d // store result
ret
mov ebx, 222 // End IACA marker bytes
db , , // End IACA marker bytes
end;
有没有办法 prefix/suffix 带有所需魔术标记的非汇编代码,以便我可以分析编译器生成的代码?
我知道我可以从 CPU 视图复制粘贴生成的程序集并使用它创建一个例程,但我希望有一个更简单的工作流程
编辑
我正在寻找适用于 64 位编译器的解决方案。我知道我可以在 32 位编译器中混合使用汇编代码和普通代码。
更新
@Dsm 的建议有效。
@Rudy 的把戏没有。
以下虚拟代码有效:
Throughput Analysis Report
--------------------------
Block Throughput: 13.33 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations)
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.3 0.0 | 1.4 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.4 | 2.0 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 3^ | 0.3 | 0.3 | 1.0 1.0 | | | 0.3 | 1.0 | | CP | ret
| X | | | | | | | | | | int3
[... more int3's]
| X | | | | | | | | | | int3
| 1 | 1.0 | | | | | | | | | shl eax, 0x10
| 1 | | 0.6 | | | | 0.3 | | | | cmp eax, 0x64
| 3^ | | 0.3 | | 1.0 1.0 | | 0.6 | 1.0 | | CP | ret
| X | | | | | | | | | | int3
| X | | | | | | | | | | int3
[...]
Total Num Of Uops: 8
更新 2
如果那里有调用语句,IACA 似乎会轰炸并且不想分析代码。抱怨非法指令。然而,基本的想法是有效的。显然,您需要减去初始 ret
及其相关费用。
我没有用过IACA所以无法测试这个想法,如果不行我会删除答案,但是你能不能这样:
procedure TForm10.Button1Click(Sender: TObject);
begin
asm
//RCX = self
//edx = a
//r8d = b
mov ebx, 111 // Start IACA marker bytes
db , , // Start IACA marker bytes
end;
fRotate( fLine - Point(0,1), 23 );
asm
mov ebx, 222 // End IACA marker bytes
db , , // End IACA marker bytes
end;
end;
这只是来自其他东西的示例例程,用于检查它是否编译,确实如此。
遗憾的是,这仅适用于 32 位 - 正如 Johan 指出的那样,它不适用于 64 位。
对于 64 位,以下可能有效,但我无法再次测试它。
procedure TForm10.Button1Click(Sender: TObject);
procedure Test1;
asm
//RCX = self
//edx = a
//r8d = b
mov ebx, 111 // Start IACA marker bytes
db , , // Start IACA marker bytes
end;
procedure Test2;
begin
fRotate( fLine - Point(0,1), 23 );
end;
procedure Test3;
asm
mov ebx, 222 // End IACA marker bytes
db , , // End IACA marker bytes
end;
begin
Test1;
Test2;
Test3;
end;
我一直在玩 IACA(英特尔的静态代码分析器)。
在使用我可以手动输入魔术标记字节的程序集片段进行测试时,它工作正常,如下所示:
procedure TSlice.BitSwap(a, b: integer);
asm
//RCX = self
//edx = a
//r8d = b
mov ebx, 111 // Start IACA marker bytes
db , , // Start IACA marker bytes
xor eax, eax
xor r10d, r10d
mov r9d, [rcx] // read the value
mov ecx,edx // need a in cl for the shift
btr r9d, edx // read and clear the a bit
setc al // convert cf to bit
shl eax, cl // shift bit to ecx position
btr r9d, r8d // read and clear the b bit
mov ecx, r8d // need b in ecx for shift
setc r10b // convert cf to bit
shl r10d, cl // shift bit to edx position
or r9d, eax // copy in old edx bit
or r9d, r10d // copy in old ecx bit
mov [r8], r9d // store result
ret
mov ebx, 222 // End IACA marker bytes
db , , // End IACA marker bytes
end;
有没有办法 prefix/suffix 带有所需魔术标记的非汇编代码,以便我可以分析编译器生成的代码?
我知道我可以从 CPU 视图复制粘贴生成的程序集并使用它创建一个例程,但我希望有一个更简单的工作流程
编辑
我正在寻找适用于 64 位编译器的解决方案。我知道我可以在 32 位编译器中混合使用汇编代码和普通代码。
更新
@Dsm 的建议有效。
@Rudy 的把戏没有。
以下虚拟代码有效:
Throughput Analysis Report
--------------------------
Block Throughput: 13.33 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations)
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.3 0.0 | 1.4 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.4 | 2.0 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 3^ | 0.3 | 0.3 | 1.0 1.0 | | | 0.3 | 1.0 | | CP | ret
| X | | | | | | | | | | int3
[... more int3's]
| X | | | | | | | | | | int3
| 1 | 1.0 | | | | | | | | | shl eax, 0x10
| 1 | | 0.6 | | | | 0.3 | | | | cmp eax, 0x64
| 3^ | | 0.3 | | 1.0 1.0 | | 0.6 | 1.0 | | CP | ret
| X | | | | | | | | | | int3
| X | | | | | | | | | | int3
[...]
Total Num Of Uops: 8
更新 2
如果那里有调用语句,IACA 似乎会轰炸并且不想分析代码。抱怨非法指令。然而,基本的想法是有效的。显然,您需要减去初始 ret
及其相关费用。
我没有用过IACA所以无法测试这个想法,如果不行我会删除答案,但是你能不能这样:
procedure TForm10.Button1Click(Sender: TObject);
begin
asm
//RCX = self
//edx = a
//r8d = b
mov ebx, 111 // Start IACA marker bytes
db , , // Start IACA marker bytes
end;
fRotate( fLine - Point(0,1), 23 );
asm
mov ebx, 222 // End IACA marker bytes
db , , // End IACA marker bytes
end;
end;
这只是来自其他东西的示例例程,用于检查它是否编译,确实如此。
遗憾的是,这仅适用于 32 位 - 正如 Johan 指出的那样,它不适用于 64 位。
对于 64 位,以下可能有效,但我无法再次测试它。
procedure TForm10.Button1Click(Sender: TObject);
procedure Test1;
asm
//RCX = self
//edx = a
//r8d = b
mov ebx, 111 // Start IACA marker bytes
db , , // Start IACA marker bytes
end;
procedure Test2;
begin
fRotate( fLine - Point(0,1), 23 );
end;
procedure Test3;
asm
mov ebx, 222 // End IACA marker bytes
db , , // End IACA marker bytes
end;
begin
Test1;
Test2;
Test3;
end;