CUDA Profiler 中 "flop_count_sp" 和 "inst_fp_32" 指标的含义

Question

flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.

inst_fp_32: Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

我有一个带有探查器输出的内核，可以添加如下内容：

flop_count_sp = flop_count_sp_add + flop_count_sp_mul + 2 * flop_count_sp_fma
inst_fp_32 = flop_count_sp_add + flop_count_sp_mul + flop_count_sp_fma

鉴于这些指标中的数字，我想知道这里的 operation 是什么，instruction 是什么？看起来 fma 是一条指令，但是是两个操作。而 add 和 mul 是一条指令和一项操作。由于 SASS 程序集由分析器计算。有没有算作操作的指令？或相反亦然。我只想知道在 nvprof 和 nvvp 指标的上下文中。

此外，当我们谈论 TFLOP/s 中的峰值性能时，我猜这里的 OP 对应于操作？如果我想估计诸如计算到全局内存访问（CGMA）之类的东西，我应该使用 flop_count_sp 而不是计算部分的 inst_fp_32 吗？提前致谢。

Answer 1

I am wondering what is an operation and what is an instruction here? It seems like a fma is one instruction, but two operations. Whereas add and mul is one instruction and one operation.

是的，正确。 Fused-Multiply-Add 指令计为 2 个操作（乘法加加法）。一条乘法或加法指令算作一次运算。

Are there any instructions that are not counted as operations?

是的，任何不使用 single-precision（或 double-precision，例如 flop_count_dp）功能单元的指令都不会对这些指标做出任何操作（无论是 inst或操作）。例如，整数指令或加载或存储指令不会影响这些指标。我不相信，任何可能具有浮点性质（例如转换 to/from 浮点）但不包含加法或乘法运算的指令都不会影响运算指标。

Also, when we talk about peak performance in TFLOP/s, the OP here corresponds to Operations i guess?

是

If I want to estimate something like compute to global memory access (CGMA), should I use flop_count_sp instead of the inst_fp_32 for the compute part?

我认为这可能是见仁见智的问题。我会使用说明。如前所述，fused-multiply-add 指令计为 2 个操作，但它不会“加倍”浮点单元的压力。因此，在比较代码以查看全局内存 load/store activity 与计算“压力”之间的平衡时，我会使用指令。再次，可能见仁见智。

CUDA Profiler 中 "flop_count_sp" 和 "inst_fp_32" 指标的含义

Meaning of the "flop_count_sp" and "inst_fp_32" metric in CUDA Profiler

profiler

cuda

gpu

nvvp

nvprof