pgi openacc 在复制输入和复制输出时抛出分段错误

pgi openacc throwing segmentation fault on copyin and copyout

我已将我的代码抛出的分段错误问题缩小为

#pragma acc data copyout(result_mat[0:MAT1_X][0:MAT2_Y]), copyin(mat1[0:MAT1_X][0:MAT1_Y],mat2[0:MAT2_X][0:MAT2_Y])

在以下代码中:

// https://github.com/wrembish/MatMul_Parallel.git
#include <iostream>

#include <omp.h>
#include <cstdlib>
#include <ctime>
#include <chrono>
using namespace std;
using namespace std::chrono;

// constant variables for the desired size of matrix 1
const size_t MAT1_X = 835;
const size_t MAT1_Y = 835;

// constant variables for the desired size of matrix 2
const size_t MAT2_X = 835;
const size_t MAT2_Y = 835;

int main() 
{
    // take start time of whole program
    auto prog_start = high_resolution_clock::now();
    // seed rand for randomly filling the matrices
    srand(time(NULL));

    // define the matrices to the variables mat1 and mat2
    int mat1[MAT1_X][MAT1_Y];
    int mat2[MAT2_X][MAT2_Y];

    // define the result matrix
    int result_mat[MAT1_X][MAT2_Y];

    // zero result matrix
    #pragma acc loop
    for(int unsigned i = 0; i < MAT1_X; i++)
    {
        for(int unsigned j = 0; j < MAT2_Y; j++)
        {
            result_mat[i][j] = 0;
        }
    }

    // fill in mat1 with random positive integers <= 100
    #pragma acc loop
    for(int unsigned i = 0; i < MAT1_X; i++)
    {
        for(int unsigned j = 0; j < MAT1_Y; j++)
        {
            mat1[i][j] = (rand() % 100) + 1;
        }
    }

    // fill in mat2 with random positive integers <= 100
    #pragma acc loop    
    for(int unsigned i = 0; i < MAT2_X; i++)
    {
        for(int unsigned j = 0; j < MAT2_Y; j++)
        {
            mat2[i][j] = (rand() % 100) + 1;
        }
    }

    // if the matrices can be multiplied, do it
    if(MAT1_Y == MAT2_X)
    {
        //#pragma omp parallel for ordered schedule(auto) collapse(3)
        #pragma acc data copyout(result_mat[0:MAT1_X][0:MAT2_Y]), copyin(mat1[0:MAT1_X][0:MAT1_Y],mat2[0:MAT2_X][0:MAT2_Y])
        #pragma kernels
        for(int unsigned i = 0; i < MAT1_X; i++)
        {
            //#pragma acc loop
            for(int unsigned j = 0; j < MAT2_Y; j++)
            {
                //#pragma acc loop seq
                for(int unsigned k = 0; k < MAT1_Y; k++)
                {
                    result_mat[i][j] += mat1[i][k] * mat2[k][j];
                }
            }
        }

    } else
    {
        cout << "the dimensions of the two matrices don't allow multiplication" << endl;
    }

    // take end time of whole program
    auto prog_stop = high_resolution_clock::now();

    // get the difference in time between program start and finish
    auto prog_duration = duration_cast<microseconds>(prog_stop - prog_start);
    cout << "time taken(program): " << prog_duration.count() << " microseconds." << endl;
}

我在我的 类 虚拟机上使用 pgi/19.4,我正在编译 运行 使用

的代码
pgc++ -ta=tesla -Minfo=accel matmul_acc.cpp
srun -p cisc372 --gres=gpu:1 ./a.out

并收到以下消息

srun: error: beowulf: task 0: Segmentation fault (core dumped)

我是 openacc 和 pgi 的新手,过去 3 小时我一直在互联网上寻找修复程序。如果有人知道我的代码有什么问题,我将不胜感激任何建议或修复。很抱歉,如果已经有任何类似的问题,但我找不到适合我的问题。

在我看来,您的段错误是由于您使用的堆栈变量变得太大所致。减小它们的大小似乎让段错误对我来说消失了(尝试 256),但真正的解决方案是让它们动态分配。索引变得更加复杂,但是您可以 运行 更大的矩阵。代码中还有其他一些 OpenACC 问题,但您接下来要解决这些问题:

1) 初始化循环中的 #pragma acc loop 指令不执行任何操作。在它们被并行化之前,您需要它们是 #pragma acc parallel loop#pragma kernels loop。由于您的数据区域位于这些循环之下,您可能只想完全删除循环编译指示。

2) #pragma kernels 也没有做任何事情,它需要是 #pragma acc kernels。如果我这样做,那么它会为 GPU 构建。