如何监控 SIMD 指令的使用量

Question

如何监控进程的 SIMD（SSE、AVX、AVX2、AVX-512）指令使用量？例如，htop 可用于监视一般 CPU 使用情况，但不能监视特定 SIMD 指令的使用情况。

Answer 1

我认为计算所有 SIMD 指令（不仅仅是 FP 数学）的唯一可靠方法是动态检测（例如通过 Intel PIN / SDE 之类的东西）。

见 and How do I determine the number of x86 machine instructions executed in a C program? specifically sde64 -mix -- ./my_program to print the instruction mix for your program for that run, example output in

我不认为有什么好的方法可以使它像 top / htop，如果它甚至可以安全地附加到已经运行ning 进程，尤其是多进程-线程一次。

也有可能使用 last-branch-record 来记录/重建执行路径并计算所有内容来获取动态指令计数，但我不知道用于此的工具。从理论上讲，可以附加到已经运行ning 的程序而没有太大的危险，但是 需要大量的计算（反汇编和计数指令）才能为所有运行宁进程。不像只是向内核询问 CPU 它在上下文切换时跟踪的使用统计信息。

您需要硬件指令计数支持才能真正像 top 那样高效。

对于 SIMD 浮点数学 特别是（不是 FP 洗牌，只是像 vaddps 这样的真正的 FP 数学），有性能计数器事件。

例如来自 perf list 输出：

fp_arith_inst_retired.128b_packed_single
[Number of SSE/AVX computational 128-bit packed single precision floating-point instructions retired. Each count represents 4 computations. Applies to SSE* and AVX* packed single precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform multiple calculations per element]

所以它甚至不计算 uops，它计算 FLOPS。 ...pd packed double 和每个的 256 位版本还有其他事件。（我假设在 CPUs 上使用 AVX512，这些事件也有 512 位矢量版本。）

您可以使用 perf 跨进程和在所有内核上全局计算它们的执行。或者对于单个进程

## count math instructions only, not SIMD integer, load/store, or anything else
perf stat -e cycles:u,instructions:u,fp_arith_inst_retired.{128,256}b_packed_{double,single}:u  ./my_program
# fixme: that brace-expansion doesn't expand properly; it separates with spaces not commas.

（故意省略 fp_arith_inst_retired.scalar_{double,single} 因为你只询问了 XMM 寄存器上的 SIMD 和标量指令不算数，IMO。）

(您可以使用 -p PID 而不是命令将 perf 附加到运行ning 进程。 或 按照建议使用perf top 参见

您可以运行 perf stat -a 全局监控所有内核，无论正在执行什么进程。但同样，这只计算 FP 数学，而不是一般的 SIMD。

不过，它是硬件支持的，因此对于像 htop 这样的东西来说可能足够便宜，如果你离开它运行ning 很长时间也不会浪费很多 CPU 时间-term.

如何监控 SIMD 指令的使用量

How do I monitor the amount of SIMD instruction usage

linux

intel

cpu-usage