如何确定我的 GPU 是否进行 16/32/64 位算术运算？

Question

我正在尝试查找我的 Nvidia 卡上本机算术运算的吞吐量。在 this 页面上，Nvidia 记录了各种算术运算的吞吐量值。问题是我如何确定我的卡是否执行 16 位或 32 位或 64 位操作，因为每个值都不同？此外，我还想为我的卡计算这些指令的延迟值。有什么办法吗？就我的研究而言，它们没有像吞吐量那样被记录下来。是否有一些用于此目的的基准套件？

谢谢！

Answer 1

how do I determine if my card does 16 or 32 or 64 bit operations, since the values are different for each?

在页面上 you linked, is listed compute capabilities across the top of the table (for each column). Your GPU has a compute capability. You can use the deviceQuery cuda sample app to figure out what it is, or look it up here。

例如，假设我有一个 GTX 1060 GPU。如果你运行deviceQuery就可以了，会报计算能力主版本为6，次版本为1，所以是计算能力6.1的GPU。您还可以看到 here.

现在，回到您链接的 table，这意味着标记为 6.1 的列是感兴趣的列。它看起来像这样：

                                            Compute Capability
                                                    6.1 
16-bit floating-point add, multiply, multiply-add   2     ops/SM/clock
32-bit floating-point add, multiply, multiply-add   128   ops/SM/clock
64-bit floating-point add, multiply, multiply-add   4     ops/SM/clock
...

这意味着 GTX 1060 能够在不同的 3 种不同精度（16 位、32 位、64 位）下进行所有 3 种类型的运算（浮点乘法、乘加或加法）每个精度的速率或吞吐量。关于 table，这些数字是 per clock 和 per SM.

为了确定整个 GPU 的总峰值理论吞吐量，我们必须将上述数字乘以 GPU 的时钟频率和 GPU 中的 SM（流式多处理器）数量。 CUDA deviceQuery 应用程序也可以告诉您这些信息，或者您可以在线查找。

Further, I also want to calculate the latency values of these instructions for my card. Is there some way to do it? As far as my research goes, they are not documented like throughput.

正如我在您的 previous question 中提到的那样，这些延迟值并未发布或指定，事实上它们可能（并且确实）在 GPU 之间、从一种指令类型到另一种指令类型（例如浮动点乘法和浮点加法可能具有不同的延迟），甚至可能从 CUDA 版本更改为 CUDA 版本，对于通过一系列多个 SASS 指令模拟的某些操作类型。

为了发现这种延迟数据，有必要进行某种形式的微基准测试。一篇早期且经常被引用的论文展示了如何为 CUDA GPU 完成此操作是 here。对于 GPU 的延迟微基准测试数据，没有一个单一的规范参考，也没有一个用于基准测试程序的规范参考。这是一项相当艰巨的任务。

Is there some benchmark suite for this purpose?

这类问题显然与 SO 无关。请阅读 here 的内容：

"Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow..."

如何确定我的 GPU 是否进行 16/32/64 位算术运算？

How to determine if my GPU does 16/32/64 bit arithmetic operations?

c++

cuda

latency

nvidia