NVidia Visual Profiler 的分析器开销过大

Question

在尝试使用 nvvp（或 nvprof）分析我的代码时，我的分析开销很大：

总时间是 98 毫秒，我在第一次内核启动时得到 85 毫秒的“检测”。

我怎样才能减少这种分析开销或以其他方式放大我感兴趣的部分？

背景

我运行未选中“开始执行并启用分析”，我使用 cudaProfilerStart/cudaProfilerStop 限制了分析，如下所示：

/* --- generate data etc --- */
// Call the function once to warm up the FFT plan cache
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
// Call it once for profiling
cudaProfilerStart();
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
cudaProfilerStop();

其中 applyConvolution() 是我正在分析的函数。

我在 Ubuntu 16.04 和 GTX 1080 上使用 CUDA Toolkit 8.0。

Answer 1

当我写这个问题时，我想我会尝试弄乱分析器设置来尝试抢占一些潜在的评论答案 material。

令我惊讶的是，禁用 "Enable concurrent kernel profiling" 完全消除了分析器开销：

但也许这不应该那么令人惊讶：

Enable concurrent kernel profiling - This option should be selected for an application that uses CUDA streams to launch kernels that can execute concurrently. If the application uses only a single stream (and therefore cannot have concurrent kernel execution), deselecting this option may decrease profiling overhead.

（取自http://docs.nvidia.com/cuda/profiler-users-guide/）

早期版本的 CUDA Profiler 用户指南还在 "Profiling Limitations" 部分指出：

Concurrent kernel mode can add significant overhead if used on kernels that execute a large number of blocks and that have short execution durations.

好吧。无论如何都要发布这个 question/answer 以防它帮助其他人避免这种烦恼。

Answer 2

我看到了类似的东西，但可能只是模糊地相关。但是由于上述答案有所帮助，我将添加我的观察结果。

在分析 Quadro GV100 时，与 pascal-gen 卡（例如 1080）相比，相当简单的内核的表观性能发生了巨大变化。我也是运行 nvvp，禁用了分析并在我感兴趣的代码的一部分中激活它。然后我不小心忘记打开它，我得到的只是我们的手动事件标记（使用 nvtxRangePush 和 nvtxRangePop ).你知道什么，十倍加速。也就是说; 在 Quadro GV100 上存在大量的性能分析开销，这在早期的 GPU 上是没有的。

像您那样禁用并发分析没有帮助，但禁用API跟踪有帮助。

虽然与手动 nvtx 相比仍然有很大的开销，但至少它允许对 GV100 上的内核性能有一些了解。较大的内核似乎受影响较小，如果它与固定成本开销或 API-tracing 相关，这是很自然的。剩下的未知数就是为什么 API-tracing 在 GV100 上的成本如此之高，但我无法推测，至少现在还不能。

我使用 gcc/5.4 和 cuda/9.0 编译了特定于 sm 的二进制文件用于上述测试，运行 RELION 单线程用于简单测试-案件。

NVidia Visual Profiler 的分析器开销过大

Excessive profiler overhead with NVidia Visual Profiler

nvvp

背景