mtune 是如何工作的？

Question

有这个相关问题：GCC: how is march different from mtune?

但是，现有的答案并不比 GCC 手册本身更进一步。最多，我们得到：

If you use -mtune, then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated.

和

The -mtune=Y option tunes the generated code to run faster on Y than on other CPUs it might run on.

但是 GCC 如何 在构建时偏爱一种特定的架构，同时仍然能够运行在其他（通常较旧的）架构上构建，尽管速度较慢？

我只知道一件事（但我不是计算机科学家）可以做到这一点，那就是 CPU 调度员。但是，（对我而言）mtune 似乎并没有在幕后生成调度程序，而可能是其他一些机制在起作用。

我有这种感觉有两个原因：

搜索 "gcc mtune cpu dispatcher" 未找到任何相关内容；和
如果它基于调度程序，我认为它会更智能（即使通过 mtune 以外的其他选项）并测试 cpuid 以在运行时检测支持的指令，而不是依赖在构建时提供的命名体系结构上。

那么它到底是如何工作的呢？

Answer 1

-mtune 不会创建调度程序，它不需要调度程序：我们已经告诉编译器我们的目标架构是什么。

来自GCC docs:

-mtune=cpu-type

Tune to cpu-type everything applicable about the generated code, except for the ABI and the
set of available instructions.

这意味着 GCC 不会使用仅在 cpu-type ¹ 上可用的指令，但它会生成代码运行在 cpu 类型 上最佳。

理解最后这句话对于理解架构和微架构之间的区别是必要的。
该体系结构暗示了一个 ISA（指令集体系结构）并且不受 -mtune.
的影响微架构是架构在硬件中的实现方式。对于相同的指令集（读取：体系结构），由于实现的内部细节，代码序列可能运行在 CPU（读取微体系结构）上最佳，但在另一个上则不然。这可以达到仅在一个微体系结构上优化代码序列的程度。

在生成机器代码时，GCC 通常在选择指令的排序方式和使用的变体方面有一定的自由度。
它将使用启发式生成指令序列，在最常见的 CPU 上运行速度很快，有时它会牺牲 CPU x 的 100% 最佳解决方案 如果这将惩罚 CPUs y、z 和 w .

当我们使用 -mtune=x 时，我们正在微调 CPU x 的 GCC 输出，从而生成 100% 最优的代码（来自GCC 的观点）CPU.

作为具体示例考虑 how this code is compiled:

float bar(float a[4], float b[4])
{
    for (int i = 0; i < 4; i++)
    {
        a[i] += b[i];
    }

    float r=0;

    for (int i = 0; i < 4; i++)
    {
        r += a[i];
    }

    return r;
}

在针对 Skylake 或 Core2 时，a[i] += b[i]; 的向量化（如果向量不重叠）不同：

天湖

    movups  xmm0, XMMWORD PTR [rsi]
    movups  xmm2, XMMWORD PTR [rdi]
    addps   xmm0, xmm2
    movups  XMMWORD PTR [rdi], xmm0
    movss   xmm0, DWORD PTR [rdi]

Core2

    pxor    xmm0, xmm0
    pxor    xmm1, xmm1
    movlps  xmm0, QWORD PTR [rdi]
    movlps  xmm1, QWORD PTR [rsi]
    movhps  xmm1, QWORD PTR [rsi+8]
    movhps  xmm0, QWORD PTR [rdi+8]
    addps   xmm0, xmm1
    movlps  QWORD PTR [rdi], xmm0
    movhps  QWORD PTR [rdi+8], xmm0
    movss   xmm0, DWORD PTR [rdi]

主要区别在于 xmm 寄存器的加载方式，在 Core2 上，它使用 movlps 和 movhps 进行两次加载，而不是使用单个 movups.
两次加载方法在 Core2 微架构上更好，如果您看一下 Agner Fog 的指令表，您会看到 movups 被解码为 4 微指令并且有 2 个周期的延迟，而每个 movXps 是 1 uop 和 1 个延迟周期。
这可能是由于当时 128 位访问被拆分为两个 64 位访问。
在 Skylake 上，情况恰恰相反：movups 的性能优于两个 movXps。

所以我们必须拿起一个。
一般来说，GCC 选择第一个变体，因为 Core2 是一个旧的微架构，但我们可以用 -mtune.

覆盖它

¹指令集用其他开关选择

mtune 是如何工作的？

How does mtune actually work?

optimization

gcc

instruction-set

cpu-architecture

instructions