RAM 中存储的 CUDA 访问矩阵及其实现的可能性

Question

最近我开始从事数值计算并以数值方式解决数学问题，使用 C++ 和 OpenMP 进行编程。但是现在我的问题太大了，即使并行化也需要几天时间才能解决。所以，我想开始学习CUDA来减少时间，但我有一些疑问。

我的代码的核心是以下函数。这些条目是指向向量的两个指针。 N_mesh_points_x,y,z 是预定义的整数，weights_x,y,z 是列矩阵，kern_1 是指数函数，table_kernel 是访问存储在 RAM 中并预先计算的 50 Gb 矩阵的函数.

void Kernel::paralel_iterate(std::vector<double>* K1, std::vector<double>* K2 )
{
  double r, sum_1 = 0 , sum_2 = 0;
  double phir;

    for (int l = 0; l < N_mesh_points_x; l++){
      for (int m = 0; m < N_mesh_points_y; m++){
        for (int p = 0; p < N_mesh_points_z; p++){
        sum_1 = 0;
        sum_2 = 0;

        #pragma omp parallel for schedule(dynamic) private(phir) reduction(+: sum_1,sum_2)
        for (int i = 0; i < N_mesh_points_x; i++){
          for (int j = 0; j < N_mesh_points_y; j++){
            for (int k = 0; k < N_mesh_points_z; k++){
               
               if (!(i==l) || !(j==m) || !(k==p)){
               phir = weights_x[i]*weights_y[j]*weights_z[k]*kern_1(i,j,k,l,m,p);
               sum_1 += phir * (*K1)[position(i,j,k)];
               sum_2 += phir;
              }

             }
           }
         }
        (*K2)[ position(l,m,p)] = sum_1 + (table_kernel[position(l,m,p)] - sum_2) * (*K1)[position (l,m,p)];
    }
  }
}

return;
}

我的问题是：

我可以在 CUDA 中至少编写此功能的核心部分吗？我只用 OpenMP 并行化了内部循环，因为当我并行化所有循环时给出了错误的答案。
访问大矩阵的函数table_kernel，矩阵太大无法存储在我的显卡内存中，所以文件将留在RAM中。这是个问题吗？ CUDA可以轻松访问RAM中的文件吗？还是无法做到，所有文件都需要存储在显卡中？

Answer 1

Can I program, at least the central part of this function, in CUDA? I only parallelized with OpenMP the internals loops because was giving the wrong answer when I parallelized all the loops.

是的，您应该能够将您当前在 OpenMP 范围内的部分编程为 CUDA 内核。

The function table_kernel who access a big matrix, the matrix is to big to be stored in the memory of my video card, so the file will stay in RAM. This is a problem? The CUDA can access easily the files in RAM? Or this can’t be done and all the files needed to be stored inside video card?

由于您只能在 OpenMP 范围之外访问它，如果您仅将 CUDA 内核用于当前使用 OpenMP 进行的工作，则不需要从 GPU 访问 table_kernel，并且因此这应该不是问题。如果您尝试添加额外的循环以在 GPU 上并行化，那么这可能会成为一个问题。由于访问相对不频繁（与内部循环中进行的处理相比），如果您想继续这样做，您可以尝试通过 cudaHostAlloc 使 table_kernel 数据可用于 GPU - 基本上在 GPU 地址 space 中映射主机内存。这通常是一个重大的性能危害，但如果您像提到的那样不经常访问它，它可能会也可能不会是一个严重的性能问题。

请注意，您将无法在设备代码中使用或访问 std::vector，因此这些类型的数据容器可能必须实现为普通的 double 数组。

RAM 中存储的 CUDA 访问矩阵及其实现的可能性

CUDA access matrix stored in RAM and possibility of being implemented

c++

cuda

numerical-computing