nvprof 未获取任何 API 调用或内核

Question

我正在尝试使用 nvprof 在我的 CUDA 程序中获得一些基准计时，但不幸的是它似乎没有分析任何 API 调用或内核。我寻找了一个简单的初学者示例以确保我做对了，并在此处的 Nvidia 开发博客上找到了一个示例：

https://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/

代码：

int main()
{
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a, bytes);

    memset(h_a, 0, bytes);
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);

    return 0;
}

命令行：

-bash-4.2$ nvcc profile.cu -o profile_test
-bash-4.2$ nvprof ./profile_test

所以我逐字逐行地复制它，运行相同的命令行参数。不幸的是，我的结果是一样的：

-bash-4.2$ nvprof ./profile_test
==85454== NVPROF is profiling process 85454, command: ./profile_test
==85454== Profiling application: ./profile_test
==85454== Profiling result:
No kernels were profiled.

==85454== API calls:
No API activities were profiled.

我是运行 Nvidia 工具包 7.5

如果有人知道我做错了什么，我将不胜感激。

-----编辑-----

所以我修改了代码为

#include<cuda_profiler_api.h>

int main()
{
    cudaProfilerStart();
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a, bytes);

    memset(h_a, 0, bytes);
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);

    cudaProfilerStop();
    return 0;
}

不幸的是，它并没有改变什么。

Answer 1

您需要在退出线程之前调用 cudaProfilerStop()（对于运行时 API）。这允许 nvprof 收集所有必要的数据。

根据CUDA doc：

To avoid losing profile information that has not yet been flushed, the application being profiled should make sure, before exiting, that all GPU work is done (using CUDA sychronization calls), and then call cudaProfilerStop() or cuProfilerStop(). Doing so forces buffered profile information on corresponding context(s) to be flushed.

Answer 2

这是统一内存分析的一个错误，标志

--unified-memory-profiling off  ./profile_test

帮我解决所有问题。

nvprof 未获取任何 API 调用或内核

nvprof not picking up any API calls or kernels

c

profiling

cuda

nvprof