nvprof 未获取任何 API 调用或内核
nvprof not picking up any API calls or kernels
我正在尝试使用 nvprof 在我的 CUDA 程序中获得一些基准计时,但不幸的是它似乎没有分析任何 API 调用或内核。我寻找了一个简单的初学者示例以确保我做对了,并在此处的 Nvidia 开发博客上找到了一个示例:
https://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/
代码:
int main()
{
const unsigned int N = 1048576;
const unsigned int bytes = N * sizeof(int);
int *h_a = (int*)malloc(bytes);
int *d_a;
cudaMalloc((int**)&d_a, bytes);
memset(h_a, 0, bytes);
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);
return 0;
}
命令行:
-bash-4.2$ nvcc profile.cu -o profile_test
-bash-4.2$ nvprof ./profile_test
所以我逐字逐行地复制它,运行 相同的命令行参数。不幸的是,我的结果是一样的:
-bash-4.2$ nvprof ./profile_test
==85454== NVPROF is profiling process 85454, command: ./profile_test
==85454== Profiling application: ./profile_test
==85454== Profiling result:
No kernels were profiled.
==85454== API calls:
No API activities were profiled.
我是 运行 Nvidia 工具包 7.5
如果有人知道我做错了什么,我将不胜感激。
-----编辑-----
所以我修改了代码为
#include<cuda_profiler_api.h>
int main()
{
cudaProfilerStart();
const unsigned int N = 1048576;
const unsigned int bytes = N * sizeof(int);
int *h_a = (int*)malloc(bytes);
int *d_a;
cudaMalloc((int**)&d_a, bytes);
memset(h_a, 0, bytes);
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);
cudaProfilerStop();
return 0;
}
不幸的是,它并没有改变什么。
您需要在退出线程之前调用 cudaProfilerStop()
(对于运行时 API)。这允许 nvprof
收集所有必要的数据。
根据CUDA doc:
To avoid losing profile information that has not yet been flushed, the
application being profiled should make sure, before exiting, that all
GPU work is done (using CUDA sychronization calls), and then call
cudaProfilerStop()
or cuProfilerStop()
. Doing so forces buffered
profile information on corresponding context(s) to be flushed.
这是统一内存分析的一个错误,标志
--unified-memory-profiling off ./profile_test
帮我解决所有问题。
我正在尝试使用 nvprof 在我的 CUDA 程序中获得一些基准计时,但不幸的是它似乎没有分析任何 API 调用或内核。我寻找了一个简单的初学者示例以确保我做对了,并在此处的 Nvidia 开发博客上找到了一个示例:
https://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/
代码:
int main()
{
const unsigned int N = 1048576;
const unsigned int bytes = N * sizeof(int);
int *h_a = (int*)malloc(bytes);
int *d_a;
cudaMalloc((int**)&d_a, bytes);
memset(h_a, 0, bytes);
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);
return 0;
}
命令行:
-bash-4.2$ nvcc profile.cu -o profile_test
-bash-4.2$ nvprof ./profile_test
所以我逐字逐行地复制它,运行 相同的命令行参数。不幸的是,我的结果是一样的:
-bash-4.2$ nvprof ./profile_test
==85454== NVPROF is profiling process 85454, command: ./profile_test
==85454== Profiling application: ./profile_test
==85454== Profiling result:
No kernels were profiled.
==85454== API calls:
No API activities were profiled.
我是 运行 Nvidia 工具包 7.5
如果有人知道我做错了什么,我将不胜感激。
-----编辑-----
所以我修改了代码为
#include<cuda_profiler_api.h>
int main()
{
cudaProfilerStart();
const unsigned int N = 1048576;
const unsigned int bytes = N * sizeof(int);
int *h_a = (int*)malloc(bytes);
int *d_a;
cudaMalloc((int**)&d_a, bytes);
memset(h_a, 0, bytes);
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);
cudaProfilerStop();
return 0;
}
不幸的是,它并没有改变什么。
您需要在退出线程之前调用 cudaProfilerStop()
(对于运行时 API)。这允许 nvprof
收集所有必要的数据。
根据CUDA doc:
To avoid losing profile information that has not yet been flushed, the application being profiled should make sure, before exiting, that all GPU work is done (using CUDA sychronization calls), and then call
cudaProfilerStop()
orcuProfilerStop()
. Doing so forces buffered profile information on corresponding context(s) to be flushed.
这是统一内存分析的一个错误,标志
--unified-memory-profiling off ./profile_test
帮我解决所有问题。