MIO Throttle 失速何时发生？

When does MIO Throttle stall happen?

据此link https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html:

Warp was stalled waiting for the MIO (memory input/output) instruction queue to be not full. This stall reason is high in cases of extreme utilization of the MIO pipelines, which include special math instructions, dynamic branches, as well as shared memory instructions.

根据这个 https://docs.nvidia.com/drive/drive_os_5.1.12.0L/nsight-graphics/activities/index.html:

May be triggered by local, global, shared, attribute, IPA, indexed constant loads (LDC), and decoupled math.

我的理解是所有的内存操作都是在LSU上执行的，所以我会想象它们一起存储在同一个指令队列中，然后由LSU单元执行。因为它们都排在一起，所以第二种解释（包括全局内存访问）对我来说更有意义。问题是，如果是这样的话，LG Throttle就没有必要了。

MIO Throttle 究竟意味着什么？所有内存指令都存储在同一个队列中吗？

MIO 是 NVIDIA SM（从 Maxwell 开始）中的一个分区，它包含在 4 个 warp 调度程序或较慢的数学执行单元（例如 XU 管道）之间共享的执行单元。

发给这些执行单元的指令首先被发送到指令队列中，允许 warp 调度程序继续从 warp 发出独立指令。如果 warp 的下一条指令是针对已满的指令队列，则 warp 将停止，直到队列未满且指令可以入队。发生此停顿时，warp 将根据指令队列类型报告节流原因。指令队列到管道的映射因芯片而异。这是一般映射。

mio_throttle（ADU、CBU、LSU、徐）
lg_throttle（路易斯安那州立大学）
- lg_throttle。尽早限制 local/global 指令允许 SM 在由于 local/global L1 未命中导致 L1 背压时继续发出共享内存指令。
tex_throttle（TEX，非*100 芯片上的 FP64，TU11x 上的 Tensor）

如果 warp 的下一条指令是针对子分区特定执行单元（FMA、ALU、Tensor、FP64（*100 GPU），则停顿原因是 math_throttle。

MIO Throttle 失速何时发生？

When does MIO Throttle stall happen?

cuda

gpu

nvidia

nsight-compute