使用 nvprof 进行分析时没有 GPU 活动
No GPU activities in profiling with nvprof
我运行nvprof.exe
在初始化数据的函数上,调用三个内核和free的数据。所有的配置文件都应该是这样的,我得到了这样的结果:
==7956== Profiling application: .\a.exe
==7956== Profiling result:
GPU activities: 52.34% 25.375us 1 25.375us 25.375us 25.375us th_single_row_add(float*, float*, float*)
43.57% 21.120us 1 21.120us 21.120us 21.120us th_single_col_add(float*, float*, float*)
4.09% 1.9840us 1 1.9840us 1.9840us 1.9840us th_single_elem_add(float*, float*, float*)
API calls: 86.77% 238.31ms 9 26.479ms 14.600us 210.39ms cudaMallocManaged
12.24% 33.621ms 1 33.621ms 33.621ms 33.621ms cuDevicePrimaryCtxRelease
0.27% 730.80us 3 243.60us 242.10us 245.60us cudaLaunchKernel
0.15% 406.90us 3 135.63us 65.400us 170.80us cudaDeviceSynchronize
0.08% 229.70us 97 2.3680us 100ns 112.10us cuDeviceGetAttribute
0.08% 206.60us 1 206.60us 206.60us 206.60us cuModuleUnload
0.01% 19.700us 1 19.700us 19.700us 19.700us cuDeviceTotalMem
0.00% 6.8000us 1 6.8000us 6.8000us 6.8000us cuDeviceGetPCIBusId
0.00% 1.9000us 2 950ns 400ns 1.5000us cuDeviceGet
0.00% 1.8000us 3 600ns 400ns 800ns cuDeviceGetCount
0.00% 700ns 1 700ns 700ns 700ns cuDeviceGetName
0.00% 200ns 1 200ns 200ns 200ns cuDeviceGetUuid
0.00% 200ns 1 200ns 200ns 200ns cuDeviceGetLuid
==7956== Unified Memory profiling result:
Device "GeForce RTX 2060 SUPER (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
18 20.000KB 8.0000KB 32.000KB 360.0000KB 300.7000us Host To Device
24 20.000KB 8.0000KB 32.000KB 480.0000KB 2.647400ms Device To Host
如您所见,GPU activities
中有三个内核。
这是源代码:
void add_elem(int n) {
float *a, *b, *c1, *c2, *c3;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
cudaMallocManaged(&c2, n * n * sizeof(float));
cudaMallocManaged(&c3, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
c2[i] = 0.0f;
c3[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n*n + blockSize - 1) / blockSize;
th_single_elem_add<<<numBlocks, blockSize>>>(a, b, c1);
th_single_row_add<<<numBlocks, blockSize>>>(a, b, c2);
th_single_col_add<<<numBlocks, blockSize>>>(a, b, c3);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
cudaFree(c2);
cudaFree(c3);
}
之后,我将初始化数据、内核调用和释放数据提取到单独的宿主函数中,然后再次调用nvprof
。结果我只得到了关于 API 电话的信息,像这样:
==18460== Profiling application: .\a.exe
==18460== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
API calls: 81.86% 158.78ms 9 17.643ms 1.4000us 158.76ms cudaMallocManaged
0.17% 322.80us 97 3.3270us 100ns 158.00us cuDeviceGetAttribute
0.11% 214.50us 1 214.50us 214.50us 214.50us cuModuleUnload
0.04% 68.600us 3 22.866us 7.3000us 39.400us cudaDeviceSynchronize
0.01% 12.100us 9 1.3440us 400ns 7.9000us cudaFree
0.00% 7.7000us 1 7.7000us 7.7000us 7.7000us cuDeviceGetPCIBusId
0.00% 2.1000us 3 700ns 300ns 1.0000us cuDeviceGetCount
0.00% 2.0000us 2 1.0000us 300ns 1.7000us cuDeviceGet
0.00% 1.2000us 3 400ns 300ns 500ns cudaLaunchKernel
0.00% 700ns 1 700ns 700ns 700ns cuDeviceGetName
0.00% 300ns 1 300ns 300ns 300ns cuDeviceGetUuid
0.00% 300ns 1 300ns 300ns 300ns cuDeviceGetLuid
如您所见,也没有部分 Unified Memory profiling result
,
所以我尝试 运行 nvprof 像这样 nvprof.exe --unified-memory-profiling off .\a.exe
但得到了相同的结果。
源代码:
void add_elem(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n*n + blockSize - 1) / blockSize;
th_single_elem_add<<<numBlocks, blockSize>>>(a, b, c1);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
}
void add_row(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
th_single_row_add<<<numBlocks, blockSize>>>(a, b, c1, n);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
}
void add_col(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
th_single_col_add<<<numBlocks, blockSize>>>(a, b, c1, n);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
}
更新:
我发现了问题,我 运行 编码了数组中的 10000000000 个元素,似乎内核甚至都没有被调用。因为我 运行 他们有 10000000 (10^8) 个元素,它花了将近 3 秒才能完成,而有 10000000000 (10^10) 个元素就立即完成了。但是没有任何错误哦。
我应该如何捕捉这种情况?
这里的原因是内核在不支持的情况下被调用 <<<numBlocks, blockSize>>>
。
在每次内核调用后添加 gpuErrchk( cudaPeekAtLastError() );
后,我得到 GPUassert: invalid configuration argument
,这意味着我的 GPU numBlocks
或 blockSize
参数不受支持。没有错误检查脚本只是默默地结束。
正如 Robber Corvella 在此处的评论中所建议的那样,正确的错误处理 link:
proper CUDA error checking
另外,运行 cuda-memcheck
有帮助
我运行nvprof.exe
在初始化数据的函数上,调用三个内核和free的数据。所有的配置文件都应该是这样的,我得到了这样的结果:
==7956== Profiling application: .\a.exe
==7956== Profiling result:
GPU activities: 52.34% 25.375us 1 25.375us 25.375us 25.375us th_single_row_add(float*, float*, float*)
43.57% 21.120us 1 21.120us 21.120us 21.120us th_single_col_add(float*, float*, float*)
4.09% 1.9840us 1 1.9840us 1.9840us 1.9840us th_single_elem_add(float*, float*, float*)
API calls: 86.77% 238.31ms 9 26.479ms 14.600us 210.39ms cudaMallocManaged
12.24% 33.621ms 1 33.621ms 33.621ms 33.621ms cuDevicePrimaryCtxRelease
0.27% 730.80us 3 243.60us 242.10us 245.60us cudaLaunchKernel
0.15% 406.90us 3 135.63us 65.400us 170.80us cudaDeviceSynchronize
0.08% 229.70us 97 2.3680us 100ns 112.10us cuDeviceGetAttribute
0.08% 206.60us 1 206.60us 206.60us 206.60us cuModuleUnload
0.01% 19.700us 1 19.700us 19.700us 19.700us cuDeviceTotalMem
0.00% 6.8000us 1 6.8000us 6.8000us 6.8000us cuDeviceGetPCIBusId
0.00% 1.9000us 2 950ns 400ns 1.5000us cuDeviceGet
0.00% 1.8000us 3 600ns 400ns 800ns cuDeviceGetCount
0.00% 700ns 1 700ns 700ns 700ns cuDeviceGetName
0.00% 200ns 1 200ns 200ns 200ns cuDeviceGetUuid
0.00% 200ns 1 200ns 200ns 200ns cuDeviceGetLuid
==7956== Unified Memory profiling result:
Device "GeForce RTX 2060 SUPER (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
18 20.000KB 8.0000KB 32.000KB 360.0000KB 300.7000us Host To Device
24 20.000KB 8.0000KB 32.000KB 480.0000KB 2.647400ms Device To Host
如您所见,GPU activities
中有三个内核。
这是源代码:
void add_elem(int n) {
float *a, *b, *c1, *c2, *c3;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
cudaMallocManaged(&c2, n * n * sizeof(float));
cudaMallocManaged(&c3, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
c2[i] = 0.0f;
c3[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n*n + blockSize - 1) / blockSize;
th_single_elem_add<<<numBlocks, blockSize>>>(a, b, c1);
th_single_row_add<<<numBlocks, blockSize>>>(a, b, c2);
th_single_col_add<<<numBlocks, blockSize>>>(a, b, c3);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
cudaFree(c2);
cudaFree(c3);
}
之后,我将初始化数据、内核调用和释放数据提取到单独的宿主函数中,然后再次调用nvprof
。结果我只得到了关于 API 电话的信息,像这样:
==18460== Profiling application: .\a.exe
==18460== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
API calls: 81.86% 158.78ms 9 17.643ms 1.4000us 158.76ms cudaMallocManaged
0.17% 322.80us 97 3.3270us 100ns 158.00us cuDeviceGetAttribute
0.11% 214.50us 1 214.50us 214.50us 214.50us cuModuleUnload
0.04% 68.600us 3 22.866us 7.3000us 39.400us cudaDeviceSynchronize
0.01% 12.100us 9 1.3440us 400ns 7.9000us cudaFree
0.00% 7.7000us 1 7.7000us 7.7000us 7.7000us cuDeviceGetPCIBusId
0.00% 2.1000us 3 700ns 300ns 1.0000us cuDeviceGetCount
0.00% 2.0000us 2 1.0000us 300ns 1.7000us cuDeviceGet
0.00% 1.2000us 3 400ns 300ns 500ns cudaLaunchKernel
0.00% 700ns 1 700ns 700ns 700ns cuDeviceGetName
0.00% 300ns 1 300ns 300ns 300ns cuDeviceGetUuid
0.00% 300ns 1 300ns 300ns 300ns cuDeviceGetLuid
如您所见,也没有部分 Unified Memory profiling result
,
所以我尝试 运行 nvprof 像这样 nvprof.exe --unified-memory-profiling off .\a.exe
但得到了相同的结果。
源代码:
void add_elem(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n*n + blockSize - 1) / blockSize;
th_single_elem_add<<<numBlocks, blockSize>>>(a, b, c1);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
}
void add_row(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
th_single_row_add<<<numBlocks, blockSize>>>(a, b, c1, n);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
}
void add_col(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
th_single_col_add<<<numBlocks, blockSize>>>(a, b, c1, n);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
}
更新: 我发现了问题,我 运行 编码了数组中的 10000000000 个元素,似乎内核甚至都没有被调用。因为我 运行 他们有 10000000 (10^8) 个元素,它花了将近 3 秒才能完成,而有 10000000000 (10^10) 个元素就立即完成了。但是没有任何错误哦。
我应该如何捕捉这种情况?
这里的原因是内核在不支持的情况下被调用 <<<numBlocks, blockSize>>>
。
在每次内核调用后添加 gpuErrchk( cudaPeekAtLastError() );
后,我得到 GPUassert: invalid configuration argument
,这意味着我的 GPU numBlocks
或 blockSize
参数不受支持。没有错误检查脚本只是默默地结束。
正如 Robber Corvella 在此处的评论中所建议的那样,正确的错误处理 link:
proper CUDA error checking
另外,运行 cuda-memcheck
有帮助