未显示已达到的占用列是 Nsight 分析结果

Question

我遇到了一个对我来说很奇怪的问题。我在 Nsight 性能分析输出中看不到实现的占用列。我使用的是 Geforce 920M GPU，NVIDIA 驱动程序版本 425.31，Nsight 版本 6.0.0.18296 和 visual studio 2017。Nsight 的版本与驱动程序兼容。谁能帮我吗？我完全不知道为什么会这样。

我使用 Nsight 性能分析和 CUDA 跟踪检查如下：

我也使用了 Visual Profiler，但在那里也看不到实现的占用率。并且 GPU 检查给出了一个错误：

请注意，正如 talonmies 提到的，上述错误是由于运行分析器未处于管理员模式。并且解决了但是实现入住还是没有显示。

这是我的代码：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#include <iostream>
#define MAX_HISTORGRAM_NUMBER 10000
#define ARRAY_SIZE 102400000

#define CHUNK_SIZE 100
#define THREAD_COUNT 8
#define SCALER 80
cudaError_t histogramWithCuda(int *a, unsigned long long int *c);

__global__ void histogramKernelSingle(unsigned long long int *c, int *a)
{
    unsigned long long int worker =  blockIdx.x*blockDim.x + threadIdx.x;
    unsigned long long int start = worker * CHUNK_SIZE;
    unsigned long long int end = start + CHUNK_SIZE;
    for (int ex = 0; ex < SCALER; ex++)
        for (long long int i = start; i < end; i++)
        {
            if (i < ARRAY_SIZE)
                atomicAdd(&c[a[i]], 1);
            else
            {
                break;
            }
        }
}

int main()
{
        int* a = (int*)malloc(sizeof(int)*ARRAY_SIZE);
        unsigned long long int* c = (unsigned long long int*)malloc(sizeof(unsigned long long int)*MAX_HISTORGRAM_NUMBER);
        for (unsigned long long i = 0; i < ARRAY_SIZE;i++)
            a[i] = rand() % MAX_HISTORGRAM_NUMBER;
        for (unsigned long long i = 0; i < MAX_HISTORGRAM_NUMBER; i++)
            c[i] = 0;

    // Add vectors in parallel.
        double start_time = omp_get_wtime();
        cudaError_t cudaStatus=histogramWithCuda(a,c);
        double end_time = omp_get_wtime();
        std::cout << end_time - start_time;
   // = 
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }
    
    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }
    unsigned long long int R = 0;
    for (int i = 0; i < MAX_HISTORGRAM_NUMBER; i++)
    {
        R += c[i];
        //printf("%d    ", c[i]);
    }
    printf("\nCORRECT:%ld   ", R/(SCALER));
    return 0;
}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t histogramWithCuda(int *a, unsigned long long int *c)
{
    int *dev_a = 0;
    unsigned long long int *dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, MAX_HISTORGRAM_NUMBER * sizeof(unsigned long long int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_a, ARRAY_SIZE * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }


    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, ARRAY_SIZE * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }
    // Launch a kernel on the GPU with one thread for each element.
    //// BLOCK CALCULATOR HERE
    

    ////BLOCK CALCULATOR HERE
    
    histogramKernelSingle << < ARRAY_SIZE / (THREAD_COUNT*CHUNK_SIZE), THREAD_COUNT>> > (dev_c, dev_a);
    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }
    
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(c, dev_c, MAX_HISTORGRAM_NUMBER * sizeof(unsigned long long int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }
    
Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    return cudaStatus;
}

提前致谢。

Answer 1

仅在配置文件 Activity 中捕获已实现的入住率。 Trace Activity 不支持捕获 GPU 性能计数器。实现的入住率是 sm__active_warps_sum / sm__actice_cycles_sum / SM__MAX_WARPS * 100.

Nsight Visual Studio 版本

Trace Activity 无法收集 Achieved Occupancy。运行命令 Nsight |开始性能分析...并在 Activity window select 配置文件 CUDA 应用程序（不是跟踪应用程序）中。默认配置文件 CUDA 应用程序包含实验 Achieved Occupancy。

NVIDIA Visual Profiler

在 NVVP 中确保您正在收集 GPU 性能计数器。默认 activity 将收集时间线但不会收集 GPU 事件。

运行 |生成时间线不会收集 Achieved Occupancy 运行 |分析应用程序将收集 Achieved Occupancy

如果您仍然遇到问题，那么您的系统权限可能有问题。请尝试使用 Nsight 配置文件 CUDA 应用程序或 NVVP | 收集另一组性能计数器收集指标和事件...

未显示已达到的占用列是 Nsight 分析结果

Achieved Occupancy column is not shown is Nsight Profiling result

cuda

nvidia

nsight