GLM 会自动使用 SIMD 吗？（以及关于 glm 性能的问题）

Question

我想检查 glm 是否在我的机器上使用 SIMD。 CPU：第 4 代 i5，OS：ArchLinux（最新），IDE：QtCreator。

我写了一个小应用程序来测试它：

#include <iostream>
#include <chrono>
//#define GLM_FORCE_SSE2
//#define GLM_FORCE_ALIGNED
#include <glm/glm.hpp>
#include <xmmintrin.h>
float glm_dot(const glm::vec4& v1, const glm::vec4& v2)
{
   auto start = std::chrono::steady_clock::now();
   auto res = glm::dot(v1, v2);
   auto end = std::chrono::steady_clock::now();
   std::cout << "glm_dot:\t\t" << res << " elasped time: " <<    std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << std::endl;
   return res;
}

float dot_pure(const glm::vec4& v1, const glm::vec4& v2)
{
   auto start = std::chrono::steady_clock::now();
   auto res = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
   auto end = std::chrono::steady_clock::now();
   std::cout << "dot_pure:\t\t" << res << " elasped time: " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << std::endl;
   return res;
}

float dot_simd(const float& v1, const float& v2)
{
   auto start = std::chrono::steady_clock::now();
   const __m128& v1m = reinterpret_cast<const __m128&>(v1);
   const __m128& v2m = reinterpret_cast<const __m128&>(v2);
   __m128 mul =  _mm_mul_ps(v1m, v2m);
   auto res = mul[0] + mul[1] + mul[2];
   auto end = std::chrono::steady_clock::now();
   std::cout << "dot_simd:\t\t" << res << " elasped time: " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << std::endl;
   return res;
}

float dot_simd_glm_type(const glm::vec4& v1, const glm::vec4& v2)
{
   auto start = std::chrono::steady_clock::now();
   const __m128& v1m = reinterpret_cast<const __m128&>(v1);
   const __m128& v2m = reinterpret_cast<const __m128&>(v2);
   __m128 mul =  _mm_mul_ps(v1m, v2m);
   auto res = mul[0] + mul[1] + mul[2];
   auto end = std::chrono::steady_clock::now();
   std::cout << "dot_simd_glm_type:\t" << res << " elasped time: " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << std::endl;
   return res;
}

int main()
{
   glm::vec4 v1 = {1.1f, 2.2f, 3.3f, 0.0f};
   glm::vec4 v2 = {3.0f, 4.0f, 5.0f, 0.0f};
   float v1_raw[] = {1.1f, 2.2f, 3.3f, 0.0f};
   float v2_raw[] = {3.0f, 4.0f, 5.0f, 0.0f};
   glm_dot(v1, v2);
   dot_pure(v1, v2);
   dot_simd(*v1_raw, *v2_raw);
   dot_simd_glm_type(v1, v2);
   return 0;
}

glm_dot() 调用glm::dot，其他函数是我的实现。当我运行它处于调试模式时，典型的结果是：

glm_dot:        28.6 elasped time: 487
dot_pure:       28.6 elasped time: 278
dot_simd:       28.6 elasped time: 57
dot_simd_glm_type:  28.6 elasped time: 52

glm::dot 从 func_geometric.inl 调用 compute_dot::call，这是点函数的“纯”实现。我不明白为什么 glm::dot（通常）比我的 dot_pure() 实现花费更多的时间，但它是调试模式所以，让我们继续发布：

glm_dot:        28.6 elasped time: 116
dot_pure:       28.6 elasped time: 53
dot_simd:       28.6 elasped time: 54
dot_simd_glm_type:28.6 elasped time: 54

并非总是如此，但通常我的纯实现比 simd 版本花费的时间更少。也许这是因为编译器也可以在我的纯实现中使用 simd，我不知道。

然而，通常 glm::dot 调用比其他三个实现慢得多。为什么？也许 glm 这次也使用纯实现？当我使用 ReleaseWithDebugInfos 时，情况似乎就是这样。

如果我注释掉源代码中的两个定义（强制使用 simd），我会得到更好的结果，但通常 glm::dot 调用仍然较慢。（在ReleaseWithDebugInfos中调试，这次没有任何显示）

glm_dot:        28.6 elasped time: 88
dot_pure:       28.6 elasped time: 63
dot_simd:       28.6 elasped time: 53
dot_simd_glm_type:28.6 elasped time: 53

glm 不应该尽可能默认使用 simd 吗？但是根据文档，它可能根本不是自动的： GLM 提供了一些基于编译器内在函数的 SIMD 优化。由于编译器参数，这些优化将自动进行。例如，如果程序使用 /arch:AVX 以 Visual Studio 编译，GLM 将检测此参数并在可用时自动使用 AVX 指令生成代码。（来源：https://chromium.googlesource.com/external/github.com/g-truc/glm/+/0.9.9-a2/manual.md）
有一个名为 test-core_setup_message 的 glm 测试，如果我运行它，glm 似乎没有检测到我的拱门（这意味着 SSE、SSE2 等）：

$ ./test-core_setup_message
__cplusplus: 201703
GCC 8
GLM_MODEL_64
GLM_ARCH:

总结一下我的问题，glm 是否自动使用 simd 指令？文档的某些部分说它是自动的，其他一些说它取决于编译器标志。当我强制使用 SSE2 时，为什么它仍然比我的 simd 调用慢？

Answer 1

If I comment out the two defines in the source code (to force using simd) than I got better results, but usually the the glm::dot call is still slower. (To debug in ReleaseWithDebugInfos doesn’t show anything this time)

你的测试不是很严格，很容易运行进入内存缓存工件。

举个例子，只是改变我得到的测试顺序：（使用 -O3 -march=x86-64 -mavx2 编译并且你的定义未设置）：

dot_simd:       28.6 elasped time: 170
dot_pure:       28.6 elasped time: 54
dot_simd_glm_type:  28.6 elasped time: 46
glm_dot:        28.6 elasped time: 47

您需要运行使用基准库进行此类测试，例如 Google Benchmark。

但即便如此。 “运行得更快”只是“使用 SIMD”的粗略代理测试。实际查看生成的程序集会更好。

我从你的例子中删除了时间代码，得到了以下See on godbolt：

glm_dot(glm::vec<4, float, (glm::qualifier)0> const&, glm::vec<4, float, (glm::qualifier)0> const&):
        vmovss  xmm0, DWORD PTR [rdi+4]
        vmovss  xmm1, DWORD PTR [rdi]
        vmulss  xmm0, xmm0, DWORD PTR [rsi+4]
        vmovss  xmm2, DWORD PTR [rdi+8]
        vmulss  xmm1, xmm1, DWORD PTR [rsi]
        vmulss  xmm2, xmm2, DWORD PTR [rsi+8]
        vaddss  xmm0, xmm0, xmm1
        vmovss  xmm1, DWORD PTR [rdi+12]
        vmulss  xmm1, xmm1, DWORD PTR [rsi+12]
        vaddss  xmm1, xmm1, xmm2
        vaddss  xmm0, xmm0, xmm1
        ret
dot_simd(float const&, float const&):
        vmovaps xmm1, XMMWORD PTR [rsi]
        vmulps  xmm1, xmm1, XMMWORD PTR [rdi]
        vshufps xmm2, xmm1, xmm1, 85
        vaddss  xmm0, xmm1, xmm2
        vunpckhps       xmm1, xmm1, xmm1
        vaddss  xmm0, xmm0, xmm1
        ret

所以您是正确的，默认情况下显然不使用 SIMD。

GLM 会自动使用 SIMD 吗？（以及关于 glm 性能的问题）

Does GLM use SIMD automatically? (and a question about glm performance)

c++

simd

glm-math

GLM 会自动使用 SIMD 吗？ （以及关于 glm 性能的问题）

Does GLM use SIMD automatically? (and a question about glm performance)

c++

simd

glm-math

GLM 会自动使用 SIMD 吗？（以及关于 glm 性能的问题）