CUDA Profiler 中 "flop_count_sp" 和 "inst_fp_32" 指标的含义

Meaning of the "flop_count_sp" and "inst_fp_32" metric in CUDA Profiler

根据 profiler user guide:

flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.

inst_fp_32: Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

我有一个带有探查器输出的内核,可以添加如下内容:

flop_count_sp = flop_count_sp_add + flop_count_sp_mul + 2 * flop_count_sp_fma
inst_fp_32 = flop_count_sp_add + flop_count_sp_mul + flop_count_sp_fma

鉴于这些指标中的数字,我想知道这里的 operation 是什么,instruction 是什么?看起来 fma 是一条指令,但是是两个操作。而 addmul 是一条指令和一项操作。由于 SASS 程序集由分析器计算。有没有算作操作的指令?或相反亦然。我只想知道在 nvprof 和 nvvp 指标的上下文中。

此外,当我们谈论 TFLOP/s 中的峰值性能时,我猜这里的 OP 对应于操作?如果我想估计诸如计算到全局内存访问(CGMA)之类的东西,我应该使用 flop_count_sp 而不是计算部分的 inst_fp_32 吗?提前致谢。

I am wondering what is an operation and what is an instruction here? It seems like a fma is one instruction, but two operations. Whereas add and mul is one instruction and one operation.

是的,正确。 Fused-Multiply-Add 指令计为 2 个操作(乘法加加法)。一条乘法或加法指令算作一次运算。

Are there any instructions that are not counted as operations?

是的,任何不使用 single-precision(或 double-precision,例如 flop_count_dp)功能单元的指令都不会对这些指标做出任何操作(无论是 inst或操作)。例如,整数指令或加载或存储指令不会影响这些指标。我不相信,任何可能具有浮点性质(例如转换 to/from 浮点)但不包含加法或乘法运算的指令都不会影响运算指标。

Also, when we talk about peak performance in TFLOP/s, the OP here corresponds to Operations i guess?

If I want to estimate something like compute to global memory access (CGMA), should I use flop_count_sp instead of the inst_fp_32 for the compute part?

我认为这可能是见仁见智的问题。我会使用说明。如前所述,fused-multiply-add 指令计为 2 个操作,但它不会“加倍”浮点单元的压力。因此,在比较代码以查看全局内存 load/store activity 与计算“压力”之间的平衡时,我会使用指令。再次,可能见仁见智。