在编译时启用 AVX512 支持会显着降低性能

Enabling AVX512 support on compilation significantly decreases performance

我有一个使用静态库的 C/C++ 项目。该库是为 'skylake' 架构构建的。该项目是一个数据处理模块,即进行许多算术运算、内存复制、查找、比较等

CPU是Xeon Gold 6130T,支持AVX512。我尝试用 -march=skylake-march=skylake-avx512 编译我的项目,然后用库 link 编译我的项目。

在使用 -march=skylake-avx512 的情况下,与使用 -march=skylake 构建的项目相比,项目性能显着下降(平均下降 30%)。

这怎么解释?可能是什么原因?

信息:

project performance is significantly decreased (by 30% on average)

在无法轻松矢量化的代码中,零星的 AVX 指令会降低您的 CPU 频率,但不会提供任何好处。在这种情况下,您可能希望完全关闭 AVX 指令。

参见Advanced Vector Extensions, Downclocking

Since AVX instructions are wider and generate more heat, Intel processors have provisions to reduce the Turbo Boost frequency limit when such instructions are being executed. The throttling is divided into three levels:

  • L0 (100%): The normal turbo boost limit.
  • L1 (~85%): The "AVX boost" limit. Soft-triggered by 256-bit "heavy" (floating-point unit: FP math and integer multiplication) instructions. Hard-triggered by "light" (all other) 512-bit instructions.
  • L2 (~60%): The "AVX-512 boost" limit. Soft-triggered by 512-bit heavy instructions. The frequency transition can be soft or hard. Hard transition means the frequency is reduced as soon as such an instruction is spotted; soft transition means that the frequency is reduced only after reaching a threshold number of matching instructions. The limit is per-thread.

Downclocking means that using AVX in a mixed workload with an Intel processor can incur a frequency penalty despite it being faster in a "pure" context. Avoiding the use of wide and heavy instructions help minimize the impact in these cases. AVX-512VL is an example of only using 256-bit operands in AVX-512, making it a sensible default for mixed loads.

另见