用于 AVX load/store 指令的英特尔 broadwell uop 融合
Intel broadwell uop fusion for AVX load/store instructions
我正在尝试确定内存绑定矢量化厕所的性能基线ps。我在 32 字节对齐环境中使用 AVX2 指令在 Intel Broadwell 芯片上执行此操作。
基线循环一次使用 8 个 YMM 寄存器从一个位置加载并非临时存储到另一个位置:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC
align 32 ;; avx2 vector alignement
global _ls_01_opt
section .text
_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp
xor rax,rax
mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;
LOOP0:
vmovapd ymm0, ymmword ptr [ (32) + rdi +8*rax]
vmovapd ymm2, ymmword ptr [ (64) + rdi +8*rax]
vmovapd ymm4, ymmword ptr [ (96) + rdi +8*rax]
vmovapd ymm6, ymmword ptr [ (128) + rdi +8*rax]
vmovapd ymm8, ymmword ptr [ (160) + rdi +8*rax]
vmovapd ymm10, ymmword ptr [ (192) + rdi +8*rax]
vmovapd ymm12, ymmword ptr [ (224) + rdi +8*rax]
vmovapd ymm14, ymmword ptr [ (256) + rdi +8*rax]
vmovntpd ymmword ptr [ (32) + rsi +8*rax], ymm0
vmovntpd ymmword ptr [ (64) + rsi +8*rax], ymm2
vmovntpd ymmword ptr [ (96) + rsi +8*rax], ymm4
vmovntpd ymmword ptr [ (128) + rsi +8*rax], ymm6
vmovntpd ymmword ptr [ (160) + rsi +8*rax], ymm8
vmovntpd ymmword ptr [ (192) + rsi +8*rax], ymm10
vmovntpd ymmword ptr [ (224) + rsi +8*rax], ymm12
vmovntpd ymmword ptr [ (256) + rsi +8*rax], ymm14
add rax, (4*8)
cmp rax, SIZE
jne LOOP0
mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;
ret
我使用 YASM assemble 它,然后使用英特尔架构代码分析器 (IACA) 进行测试,它告诉我:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: PORT2_AGU, PORT3_AGU, Port4
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 8.0 4.0 | 8.0 4.0 | 8.0 | 0.5 | 0.5 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm0, ymmword ptr [rdi+rax*8+0x20]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm2, ymmword ptr [rdi+rax*8+0x40]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm4, ymmword ptr [rdi+rax*8+0x60]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm6, ymmword ptr [rdi+rax*8+0x80]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm8, ymmword ptr [rdi+rax*8+0xa0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm10, ymmword ptr [rdi+rax*8+0xc0]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm12, ymmword ptr [rdi+rax*8+0xe0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm14, ymmword ptr [rdi+rax*8+0x100]
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x20], ymm0
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x40], ymm2
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x60], ymm4
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x80], ymm6
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xa0], ymm8
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xc0], ymm10
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xe0], ymm12
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x100], ymm14
| 1 | | 0.5 | | | | 0.5 | | | | add rax, 0x20
| 1 | 0.5 | | | | | | 0.5 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff78
我的印象是我可以使用 broadwell 一次获得 2 倍的负载,同时在端口 2 和 3 上加载。为什么没有发生这种情况?
谢谢
更新
根据下面的建议,pd 被替换为 ps,地址被合并到一个寄存器中,新代码如下:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC
align 32 ;; avx2 vector alignement
global _ls_01_opt
section .text
_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp
xor rax,rax
xor rbx,rbx
xor rcx,rcx
or rbx, rdi
or rcx, rsi
mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;
LOOP0:
vmovaps ymm0, ymmword ptr [ (32) + rbx ]
vmovaps ymm2, ymmword ptr [ (64) + rbx ]
vmovaps ymm4, ymmword ptr [ (96) + rbx ]
vmovaps ymm6, ymmword ptr [ (128) + rbx ]
vmovaps ymm8, ymmword ptr [ (160) + rbx ]
vmovaps ymm10, ymmword ptr [ (192) + rbx ]
vmovaps ymm12, ymmword ptr [ (224) + rbx ]
vmovaps ymm14, ymmword ptr [ (256) + rbx ]
vmovntps ymmword ptr [ (32) + rcx], ymm0
vmovntps ymmword ptr [ (64) + rcx], ymm2
vmovntps ymmword ptr [ (96) + rcx], ymm4
vmovntps ymmword ptr [ (128) + rcx], ymm6
vmovntps ymmword ptr [ (160) + rcx], ymm8
vmovntps ymmword ptr [ (192) + rcx], ymm10
vmovntps ymmword ptr [ (224) + rcx], ymm12
vmovntps ymmword ptr [ (256) + rcx], ymm14
add rax, (4*8)
add rbx, (4*8*8)
add rcx, (4*8*8)
cmp rax, SIZE
jne LOOP0
mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;
ret
然后IACA告诉我:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: Port4
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 5.3 4.0 | 5.3 4.0 | 8.0 | 1.0 | 1.0 | 5.3 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm0, ymmword ptr [rbx+0x20]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm2, ymmword ptr [rbx+0x40]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm4, ymmword ptr [rbx+0x60]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm6, ymmword ptr [rbx+0x80]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm8, ymmword ptr [rbx+0xa0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm10, ymmword ptr [rbx+0xc0]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm12, ymmword ptr [rbx+0xe0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm14, ymmword ptr [rbx+0x100]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x20], ymm0
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x40], ymm2
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x60], ymm4
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x80], ymm6
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xa0], ymm8
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xc0], ymm10
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xe0], ymm12
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0x100], ymm14
| 1 | 1.0 | | | | | | | | | add rax, 0x20
| 1 | | 1.0 | | | | | | | | add rbx, 0x100
| 1 | | | | | | 1.0 | | | | add rcx, 0x100
| 1 | | | | | | | 1.0 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff7a
这告诉我商店现在可以使用端口 7 作为地址并且操作已存储。 IACA 告诉我 "Block throughput" 仍然是 8 个操作,因为需要额外的操作才能将地址放到一个寄存器中。也许我做错了?
我还是不明白为什么加载操作不能融合
端口 7 上的存储 AGU 只能处理 "simple" 个有效地址,因此您的存储还需要加载端口上的 AGU。 IACA 确实显示您的负载实际上并没有相互竞争;是商店在竞争。
请注意,每个内核只有约 10 个用于 MOVNT 存储的填充缓冲区,因此这些缓冲区将很快填满并成为瓶颈。
另见 Micro fusion and addressing modes。如果您为它们使用单寄存器寻址模式,您的商店可以微融合并采用更少的融合域微指令。
此外,我想这对于 VEX 编码指令并不重要,但 SSE pd
版本需要额外字节的 x86 机器代码。 clang
倾向于将 movaps
用于 loads/stores,因为它更短,即使在整数向量上也是如此。每个现有的 CPU 运行 movaps
/ movapd
相同。所以我建议只使用 vmovaps
/ vmovntps
。不过,这根本不会有任何区别。 VEX 前缀中只少了一个设置位。
我正在尝试确定内存绑定矢量化厕所的性能基线ps。我在 32 字节对齐环境中使用 AVX2 指令在 Intel Broadwell 芯片上执行此操作。
基线循环一次使用 8 个 YMM 寄存器从一个位置加载并非临时存储到另一个位置:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC
align 32 ;; avx2 vector alignement
global _ls_01_opt
section .text
_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp
xor rax,rax
mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;
LOOP0:
vmovapd ymm0, ymmword ptr [ (32) + rdi +8*rax]
vmovapd ymm2, ymmword ptr [ (64) + rdi +8*rax]
vmovapd ymm4, ymmword ptr [ (96) + rdi +8*rax]
vmovapd ymm6, ymmword ptr [ (128) + rdi +8*rax]
vmovapd ymm8, ymmword ptr [ (160) + rdi +8*rax]
vmovapd ymm10, ymmword ptr [ (192) + rdi +8*rax]
vmovapd ymm12, ymmword ptr [ (224) + rdi +8*rax]
vmovapd ymm14, ymmword ptr [ (256) + rdi +8*rax]
vmovntpd ymmword ptr [ (32) + rsi +8*rax], ymm0
vmovntpd ymmword ptr [ (64) + rsi +8*rax], ymm2
vmovntpd ymmword ptr [ (96) + rsi +8*rax], ymm4
vmovntpd ymmword ptr [ (128) + rsi +8*rax], ymm6
vmovntpd ymmword ptr [ (160) + rsi +8*rax], ymm8
vmovntpd ymmword ptr [ (192) + rsi +8*rax], ymm10
vmovntpd ymmword ptr [ (224) + rsi +8*rax], ymm12
vmovntpd ymmword ptr [ (256) + rsi +8*rax], ymm14
add rax, (4*8)
cmp rax, SIZE
jne LOOP0
mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;
ret
我使用 YASM assemble 它,然后使用英特尔架构代码分析器 (IACA) 进行测试,它告诉我:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: PORT2_AGU, PORT3_AGU, Port4
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 8.0 4.0 | 8.0 4.0 | 8.0 | 0.5 | 0.5 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm0, ymmword ptr [rdi+rax*8+0x20]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm2, ymmword ptr [rdi+rax*8+0x40]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm4, ymmword ptr [rdi+rax*8+0x60]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm6, ymmword ptr [rdi+rax*8+0x80]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm8, ymmword ptr [rdi+rax*8+0xa0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm10, ymmword ptr [rdi+rax*8+0xc0]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm12, ymmword ptr [rdi+rax*8+0xe0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm14, ymmword ptr [rdi+rax*8+0x100]
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x20], ymm0
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x40], ymm2
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x60], ymm4
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x80], ymm6
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xa0], ymm8
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xc0], ymm10
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xe0], ymm12
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x100], ymm14
| 1 | | 0.5 | | | | 0.5 | | | | add rax, 0x20
| 1 | 0.5 | | | | | | 0.5 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff78
我的印象是我可以使用 broadwell 一次获得 2 倍的负载,同时在端口 2 和 3 上加载。为什么没有发生这种情况?
谢谢
更新
根据下面的建议,pd 被替换为 ps,地址被合并到一个寄存器中,新代码如下:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC
align 32 ;; avx2 vector alignement
global _ls_01_opt
section .text
_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp
xor rax,rax
xor rbx,rbx
xor rcx,rcx
or rbx, rdi
or rcx, rsi
mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;
LOOP0:
vmovaps ymm0, ymmword ptr [ (32) + rbx ]
vmovaps ymm2, ymmword ptr [ (64) + rbx ]
vmovaps ymm4, ymmword ptr [ (96) + rbx ]
vmovaps ymm6, ymmword ptr [ (128) + rbx ]
vmovaps ymm8, ymmword ptr [ (160) + rbx ]
vmovaps ymm10, ymmword ptr [ (192) + rbx ]
vmovaps ymm12, ymmword ptr [ (224) + rbx ]
vmovaps ymm14, ymmword ptr [ (256) + rbx ]
vmovntps ymmword ptr [ (32) + rcx], ymm0
vmovntps ymmword ptr [ (64) + rcx], ymm2
vmovntps ymmword ptr [ (96) + rcx], ymm4
vmovntps ymmword ptr [ (128) + rcx], ymm6
vmovntps ymmword ptr [ (160) + rcx], ymm8
vmovntps ymmword ptr [ (192) + rcx], ymm10
vmovntps ymmword ptr [ (224) + rcx], ymm12
vmovntps ymmword ptr [ (256) + rcx], ymm14
add rax, (4*8)
add rbx, (4*8*8)
add rcx, (4*8*8)
cmp rax, SIZE
jne LOOP0
mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;
ret
然后IACA告诉我:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: Port4
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 5.3 4.0 | 5.3 4.0 | 8.0 | 1.0 | 1.0 | 5.3 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm0, ymmword ptr [rbx+0x20]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm2, ymmword ptr [rbx+0x40]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm4, ymmword ptr [rbx+0x60]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm6, ymmword ptr [rbx+0x80]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm8, ymmword ptr [rbx+0xa0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm10, ymmword ptr [rbx+0xc0]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm12, ymmword ptr [rbx+0xe0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm14, ymmword ptr [rbx+0x100]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x20], ymm0
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x40], ymm2
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x60], ymm4
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x80], ymm6
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xa0], ymm8
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xc0], ymm10
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xe0], ymm12
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0x100], ymm14
| 1 | 1.0 | | | | | | | | | add rax, 0x20
| 1 | | 1.0 | | | | | | | | add rbx, 0x100
| 1 | | | | | | 1.0 | | | | add rcx, 0x100
| 1 | | | | | | | 1.0 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff7a
这告诉我商店现在可以使用端口 7 作为地址并且操作已存储。 IACA 告诉我 "Block throughput" 仍然是 8 个操作,因为需要额外的操作才能将地址放到一个寄存器中。也许我做错了?
我还是不明白为什么加载操作不能融合
端口 7 上的存储 AGU 只能处理 "simple" 个有效地址,因此您的存储还需要加载端口上的 AGU。 IACA 确实显示您的负载实际上并没有相互竞争;是商店在竞争。
请注意,每个内核只有约 10 个用于 MOVNT 存储的填充缓冲区,因此这些缓冲区将很快填满并成为瓶颈。
另见 Micro fusion and addressing modes。如果您为它们使用单寄存器寻址模式,您的商店可以微融合并采用更少的融合域微指令。
此外,我想这对于 VEX 编码指令并不重要,但 SSE pd
版本需要额外字节的 x86 机器代码。 clang
倾向于将 movaps
用于 loads/stores,因为它更短,即使在整数向量上也是如此。每个现有的 CPU 运行 movaps
/ movapd
相同。所以我建议只使用 vmovaps
/ vmovntps
。不过,这根本不会有任何区别。 VEX 前缀中只少了一个设置位。