仅特定核心数的 MPI 内存损坏

MPI memory corruption on specific core counts only

对于某些背景,我正在使用 MPI 并行化基本 PDE 求解器。该程序采用一个域并为每个处理器分配一个覆盖其一部分的网格。如果我 运行 使用单核或四核,程序 运行 就可以了。但是,如果我 运行 有两个或三个核心,我会得到如下核心转储:

*** Error in `MeshTest': corrupted size vs. prev_size: 0x00000000018bd540 ***
======= Backtrace: =========
*** Error in `MeshTest': corrupted size vs. prev_size: 0x00000000022126e0 ***
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc1a63f77e5]
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x80dfb)[0x7fc1a6400dfb]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fca753f77e5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fc1a640453c]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7fca753fe9dc]
/usr/lib/libmpi.so.12(+0x25919)[0x7fc1a6d25919]
/lib/x86_64-linux-gnu/libc.so.6(+0x80678)[0x7fca75400678]
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x52a9)[0x7fc198fe52a9]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fca7540453c]
/usr/lib/libmpi.so.12(ompi_mpi_finalize+0x412)[0x7fc1a6d41a22]
/usr/lib/libmpi.so.12(+0x25919)[0x7fca75d25919]
MeshTest(_ZN15MPICommunicator7cleanupEv+0x26)[0x422e70]
/usr/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x4381)[0x7fca68844381]
MeshTest(main+0x364)[0x41af2a]
/usr/lib/libopen-pal.so.13(mca_base_component_close+0x19)[0x7fca74c88fe9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fc1a63a0830]
/usr/lib/libopen-pal.so.13(mca_base_components_close+0x42)[0x7fca74c89062]
MeshTest(_start+0x29)[0x41aaf9]
/usr/lib/libmpi.so.12(+0x7d3b4)[0x7fca75d7d3b4]
======= Memory map: ========
<insert core dump>

我已经能够追踪到创建新网格时的错误:

Result Domain::buildGrid(unsigned int shp[2], pair2<double> &bounds){
  // ... Unrelated code ...

  // grid is already allocated and needs to be cleared.
  delete grid;                                                                                                         
  grid = new Grid(bounds, shp, nghosts);                                                                                                                                                                                                    
  return SUCCESS;                                                                                                    
}

Grid::Grid(const pair2<double>& bounds, unsigned int sz[2], unsigned int nghosts){
  // ... Code unrelated to memory allocation ...

  // Construct the grid. Start by adding ghost points.
  shp[0] = sz[0] + 2*nghosts;
  shp[1] = sz[1] + 2*nghosts;
  try{
    points[0] = new double[shp[0]];
    points[1] = new double[shp[1]];
    for(int i = 0; i < shp[0]; i++){
      points[0][i] = grid_bounds[0][0] + (i - (int)nghosts)*dx;
    }
    for(int j = 0; j < shp[1]; j++){
      points[1][j] = grid_bounds[1][0] + (j - (int)nghosts)*dx;
    }
  }
  catch(std::bad_alloc& ba){
    std::cout << "Failed to allocate memory for grid.\n";
    shp[0] = 0;
    shp[1] = 0;
    dx = 0;
    points[0] = NULL;
    points[1] = NULL;
  }
}

Grid::~Grid(){
  delete[] points[0];
  delete[] points[1];
}

据我所知,我的 MPI 实现很好,Domain class 中所有依赖于 MPI 的功能似乎都能正常工作。我假设在某处非法访问其范围之外的内存,但我不知道在哪里;在这一点上,代码实际上只是初始化 MPI,加载一些参数,设置网格(在其构造期间发生唯一的内存访问),然后调用 MPI_Finalize() 和 returns.

事实证明,我的 Grid 构造函数在分配点时出现错误(在分配 y 点时读取 points[0][j] = ...),我在将代码复制到其中时以某种方式捕获并更正了错误我的 post 但不在我的代码中。该错误仅出现在 2 和 3 核运行中,因为网格对于 1 和 4 核运行是完全正方形的,因此 shp[0] 等于 shp[1]。谢谢大家的提示。这么简单的事情现在看到都觉得有点尴尬