为什么超标量处理器的非超标量部分的性能会受到影响？

Question

在 ILP section of Wikipedia's CPU article 的倒数第二段中：

In the case where a portion of the CPU is superscalar and part is not, the part which is not suffers a performance penalty due to scheduling stalls. The Intel P5 Pentium had two superscalar ALUs which could accept one instruction per clock cycle each, but its FPU could not accept one instruction per clock cycle. Thus the P5 was integer superscalar but not floating point superscalar.

什么是调度停顿？ 为什么 CPU 的非超标量部分的性能会受到影响？

这是说标量部分比 CPU 的其余部分是标量时要慢吗？

Answer 1

超标量 processor/core 是可以并行解码多条指令的超标量，processor/core 被视为整体超标量，而不是部分超标量。不确定 "scheduling stalls" 是否有一些标准定义，但作者的目的是强调由于执行单元的限制（仅限于只有两个 ALU 和一个 FPU）导致的管道中的停顿，如果他用过 temm 管道摊位。当执行单元的 i/p 操作数不可用或操作数可用但执行单元不可用时，就会发生停顿。例如在下面的代码中

int i1=1, i2=2, i3=3, i4=4;
float f1=1.0, f2=2.0;

i1 = i1 + i2;
i3 = i3 + i4;
f1 = f1 / (float) i1;
f2 = f2 / (float) i2;

前两条指令可以并行执行，但由于只有一个浮点单元，因此后两条指令不能并行执行。因此第四条指令必须等待调度，因为浮点单元已被占用。

Answer 2

我以前没听说过这个术语 "scheduling stall"，但听起来它只是在说管道会在标量部分出现瓶颈。

标量部分仍以最大吞吐量运行。所以我认为那篇维基百科文章的措辞具有误导性："the part which is not suffers a performance penalty" 听起来标量部分不会达到自己的最大吞吐量。

如果 CPU 的超标量部分期望每个周期发出 2 条指令，但它只能发出 1 条指令，因为没有可用的执行资源，我猜这算作 "stall" .

Answer 3

来自维基百科：

结构性危害

当两个或多个指令同时需要处理器硬件的一部分时，就会发生结构性危险。一个典型的例子是一个内存单元，它在从内存中检索指令的提取阶段和写入数据的内存阶段and/or 从内存中读取。[3]它们通常可以通过将组件分成正交单元（例如单独的缓存）或 冒泡管道.

来解决

P5 是流水线有序超标量（双发）微处理器。因此它可以在同一周期启动 1 个整数和 1 个任何操作，但不能对它们重新排序，指令问题是成对的，而不是独立的。

重要案例有：

cpu 直到解码阶段才知道指令是 fp 还是 int（除非预取），所以当只有单个流水线可用时，2x 获取+解码能力没有用
即使可以同时执行多个fp_add，也不能同时启动-->性能命中
如果一条指令不可配对，一个流水线停止，将阶段移动到另一个流水线，两个指令被序列化
如果两条指令（双发）都是 fp 而一条不可配对，至少它们是流水线的但同时仅限于几条指令（fp 的阶段数为 8（停止时更高 = 更差) 所以效率不如整数)
一些迭代 fp 函数会占用 fpu 的所有资源，因此无法对其他 fp 指令进行流水线处理（FDIV 会停止任何新的 fp 指令）。可能是 0.1-0.5 IPC。它们重复使用寄存器，不能同时与 mmx 指令一起使用。

最多可以同时执行 3x fp_add（但不能像整数一样同时开始）或者 1 fp_mul + 2 fp_add 可以同时执行时间（但不能同时开始）和 fpu 由 8 个阶段组成，因此超过 3 个添加开始停滞，直到第一个完成。

也许可以代替

int = int + int              // start at cycle-0
int = int + int              // start at cycle-0
float = float + float        // start at cycle-1
float = float + float        // start at cycle-2

你可以试试

float = float + float     // start at cycle-0
int = int + int           // start at cycle-0
float = float + float     // start at cycle-1
int = int + int           // start at cycle-1

让配对工作并获得更好的性能。

纯整数：

 i0 fetch
 i0 decode i1 fetch
 i0 operands from mem  i1 decode i2 fetch 
 i0 execute i1 operands i2 decode ... i3 something
 i0 store i1 execute i2 operands ...  i4 something

4条流水线同时运行 = 8条指令畅通无阻运行= 2IPC

纯浮点数：

 f0-fetch
 f0-decode f1-fetch
 f0-operand f1-decode f2-fetch
 f0-execute f1-operand f2-decode (3rd fp issued, no more fp additions)
 f0-store ....
 fx-fetch

3运行并行，小于1个IPC。 FDiv 甚至更慢。

为什么超标量处理器的非超标量部分的性能会受到影响？

Why is the performance of non-superscalar parts of a superscalar processor affected?

cpu

processor

cpu-architecture