为什么 GPU 的一个 SM 中有两个 warp 调度器？

Why there are two warp schedulers in a SM of GPU?

我阅读了 NVIDIA Fermi 白皮书，在计算 SP 核心数和调度程序时感到困惑。

根据白皮书，在每个SM中，有两个warp调度器和两个指令调度单元，允许同时发出和执行两个warp。一个SM有32个SP核，每个核都有一个全流水线的ALU和FPU，用来执行一个线程的指令

众所周知，一个warp是由32个线程组成的，如果我们每个周期只发出一个warp，则意味着这个warp中的所有线程将占用所有SP核心并在一个周期内完成执行（假设没有任何摊位）。

但是，NVIDIA 设计了双调度程序，它 select 两个 warp，并从每个 warp 向一组 16 个内核、16 个 load/store 单元或四个 SFU 发出一条指令。

NVIDIA 表示此设计可实现最高硬件性能。也许硬件的峰值性能来自不同指令的交错执行，充分利用了硬件资源。

我的问题如下（假设没有内存停顿并且所有操作数都可用）：

Does each warp need two cycles to finish execution and all 32 SP cores are divided into two groups for each warp scheduler?

是的。 Fermi 与后代不同，它有一个 "hotclock"（着色器时钟），它以 "core" 时钟的 2 倍运行。每个单精度浮点指令（例如）发出超过 2 "hotclocks"，但发送给同一组 16 个 SP 内核。净效应是每个调度程序每个 "core" 时钟一个问题。

the ld/st and SFU units are shared by all the warps(looks like uniform for warps from dual schedulers)?

不是很理解问题。所有执行资源是shared/available 用于来自任一调度程序的指令。

if a warp is divided into two parts, which part is scheduled first? is there any scheduler? or just random selects one part to execute.

为什么这很重要？机器的行为就像在一个核心时钟中安排了两个完整的 warp 指令，即 "dual issue"。无论如何，您无法了解热时钟级别发生的任何事情。

what is the advantage of this design? just maximize the utilization of hardware?

是的，正如费米白皮书中所述：

" Using this elegant model of dual-issue, Fermi achieves near peak hardware performance. "