带有增量的 AVX 加载指令

AVX load instruction with increment

是否有 AVX 指令能够从具有增量的常规对齐向量中加载四个双精度值？因此，如果我想要像 _mm256_load_pd(a) 这样的调用仅以 4 为增量，那么就不会加载值 a[0]、a[1]、a[2] 和 a[3]，但是 a[0]、a[4]、a[8] 和 a[12]?

如果您有 AVX2（Haswell 及更高版本），那么您可以使用收集的负载，例如_mm256_i32gather_pd。来自 Intel Intrinsics Guide:

Synopsis

__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

#include "immintrin.h"

Instruction: vgatherdpd ymm, vm64x, ymm

CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

如评论中所述，在 Haswell 上收集的负载很慢，但如果您需要此访问模式以进行后续 256 位 SIMD 操作，它们可能仍然值得。不过，由于您使用的是 doubles，任何好处都可能很小，因此您可能还想针对传统的标量实现进行基准测试。

带有增量的 AVX 加载指令

AVX load instruction with increment

x86

simd

vectorization

avx