aarch64 上未对齐 SIMD load/store 的性能

Question

一个表示aarch64支持未对齐reads/writes并且提到了性能成本，但不清楚答案是否也仅涵盖ALU或SIMD（128位寄存器）操作.

相对于对齐的 128 位 NEON 加载和存储，aarch64 上未对齐的 128 位 NEON 加载和存储要慢多少（如果有的话）？

对于未对齐的 SIMD 加载和存储是否有单独的指令（如 SSE2 的情况）或者已知对齐的 loads/stores 是否与可能未对齐的指令相同 loads/stores？

Answer 1

根据 4.6 Load/Store 对齐 部分中的 Cortex-A57 Software Optimization Guide，它说：

The ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-A57 processor handles most unaligned accesses without performance penalties. However, there are cases which reduce bandwidth or incur additional latency, as described below:

Load operations that cross a cache-line (64-byte) boundary

Store operations that cross a 16-byte boundary

所以这可能取决于您使用的处理器，无序（A57、A72、A-72、A-75）或有序（A-35、A-53、A-55） .我没有找到任何针对有序处理器的优化指南，但是它们确实有一个硬件性能计数器，您可以使用它来检查未对齐指令的数量是否会影响性能：

    0xOF_UNALIGNED_LDST_RETIRED Unaligned load-store

这可以与 perf 工具一起使用。

AArch64 中没有针对未对齐访问的特殊说明。

Answer 2

如果 load/store 必须拆分或跨越缓存行，则至少需要一个额外的周期。

有详尽的表格指定各种对齐所需的周期数和 Cortex-A8 (in-order) and Cortex-A9 的寄存器数量（部分 OoO）。例如，与 64 位对齐访问相比，具有一个 reg 的 vld1 对未对齐访问有 1 个周期的惩罚。

Cortex-A55（按顺序）最多执行 64 位加载和 128 位存储，因此，its optimization manual 的第 3.3 节指出：

• Load operations that cross a 64-bit boundary
• 128-bit store operations that cross a 128-bit boundary

Cortex-A75 (OoO) 根据 its optimization guide 的第 5.4 节受到惩罚：

• Load operations that cross a 64-bit boundary.
• In AArch64, all stores that cross a 128-bit boundary.
• In AArch32, all stores that cross a 64-bit boundary.

正如 Guillermo 的回答一样，A57 (OoO) 对以下方面有处罚：

• Load operations that cross a cache-line (64-byte) boundary
• Store operations that cross a [128-bit] boundary

鉴于 A55 和 A75 会，我有点怀疑 A57 不会因跨越 64 位边界而受到惩罚。所有这些都有 64 字节的缓存行；他们也应该对跨越缓存行进行处罚。最后，请注意有 unpredictable behavior for split access crossing pages.

从使用 Cavium ThunderX 的一些粗略测试（没有性能计数器）来看，似乎有接近 2 个周期的惩罚，但这可能是背靠背未对齐加载和存储的附加效果一个循环。

AArch64 NEON 指令不区分对齐和未对齐（参见LD1 for example). For AArch32 NEON, alignment is specified statically in the addressing (VLDn）：

vld1.32 {d16-d17}, [r0]    ; no alignment
vld1.32 {d16-d17}, [r0@64] ; 64-bit aligned
vld1.32 {d16-d17}, [r0:64] ; 64 bit-aligned, used by GAS to avoid comment ambiguity

我不知道在 AArch32 模式下，在最近的芯片运行上，不带对齐限定符的对齐访问执行速度是否比带对齐限定符的访问慢。 ARM 的一些旧文档鼓励尽可能使用限定符。（英特尔改进了他们的芯片，通过比较，未对齐和对齐的移动在地址对齐时执行相同。）

如果您使用的是内在函数，MSVC 有接受对齐的 _ex 后缀变体。让 GCC 发出对齐限定符的可靠方法是使用 __builtin_assume_aligned.

// MSVC
vld1q_u16_ex(addr, 64);
// GCC:
addr = (uint16_t*)__builtin_assume_aligned(addr, 8);
vld1q_u16(addr);

Answer 3

对齐提示未在 aarch64 上使用。它们是透明的。如果指针与数据类型大小对齐，性能优势是自动的。

如有疑问，对于 GCC/Clang，请在变量声明中使用 __attribute__((__aligned__(16)))。

aarch64 上未对齐 SIMD load/store 的性能

Performance of unaligned SIMD load/store on aarch64

simd

alignment

neon

arm64