MPI_Dims_create 在远程机器上抛出错误
MPI_Dims_create throws error on remote machine
我正在尝试设置笛卡尔网格作为算法的第一步。在我的本地机器上(OS X with clang),我的代码运行了。在研究集群(Linux 与 GNU)上,我收到以下错误。
$mpirun -n 4 ./test.exe
[shas0137:67495] *** An error occurred in MPI_Dims_create
[shas0137:67495] *** reported by process [3694788609,1]
[shas0137:67495] *** on communicator MPI_COMM_WORLD
[shas0137:67495] *** MPI_ERR_DIMS: invalid topology dimension
[shas0137:67495] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[shas0137:67495] *** and potentially your MPI job)
[shas0137:67491] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[shas0137:67491] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
设置"orte_base_help_aggregate"参数只是重复错误信息4次(每个进程一次)
大多数 MPI 例程在文档中都有错误代码列表,但 MPI_ERR_DIMS 未列出 MPI_Dims_create
版本详情:本地
$ mpic++ -v
Apple LLVM version 9.0.0 (clang-900.0.37)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
版本详情:研究集群
$ mpic++ -v
Using built-in specs.
COLLECT_GCC=/curc/sw/gcc/6.1.0/bin/g++
COLLECT_LTO_WRAPPER=/curc/sw/gcc/6.1.0/libexec/gcc/x86_64-pc-linux-gnu/6.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-6.1.0/configure --prefix=/curc/sw/gcc/6.1.0 --enable-languages=c,c++,fortran,go --disable-multilib --with-tune=intel
Thread model: posix
gcc version 6.1.0 (GCC)
MCVE
(或者尽可能接近可验证,因为这只发生在某些配置上)
#include <mpi.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
MPI_Status st;
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm grid_comm;
int size;
int err = MPI_Comm_size(comm, &size);
//Error handling
if (err != MPI_SUCCESS) {
return 1;
}
std::cout << size << std::endl;
//This call throws the error
int dims[3];
err = MPI_Dims_create(size, 3, dims);
//Error handling
if (err != MPI_SUCCESS) {
return 2;
}
MPI_Finalize();
return 0;
}
TLDR
我声明了但没有初始化 dim 数组。在我的本地机器上,我恰好得到了一块干净的内存。
为什么这很重要?
我没有意识到这很重要,部分原因是我没有意识到 dims 既是输入参数又是输出参数。我假设作为输出参数,dims 中的任何值都会被 MPI 例程覆盖。
我查阅的原始文档 (from MPICH) 非常简短。
MPI_Dims_create
Input/Output Parameters
dims
integer array of size ndims specifying the number of nodes in
each dimension. A value of 0 indicates that MPI_Dims_create should
fill in a suitable value. (emphasis mine)
即非零值不会被例程覆盖。
有关 MPI Forum MPI Documentation 的更多信息:
6.5.2. Cartesian Convenience Function: MPI_DIMS_CREATE
(...)
The caller may further constrain the operation of this routine by
specifying elements of array dims. If dims[i] is set to a positive
number, the routine will not modify the number of nodes in dimension
i; only those entries where dims[i] = 0 are modified by the call.
Negative input values of dims[i] are erroneous. (...)
简而言之,在研究集群上,未初始化的 dims 数组中的值要么为负数,要么大于我尝试分配的节点数。
我正在尝试设置笛卡尔网格作为算法的第一步。在我的本地机器上(OS X with clang),我的代码运行了。在研究集群(Linux 与 GNU)上,我收到以下错误。
$mpirun -n 4 ./test.exe
[shas0137:67495] *** An error occurred in MPI_Dims_create
[shas0137:67495] *** reported by process [3694788609,1]
[shas0137:67495] *** on communicator MPI_COMM_WORLD
[shas0137:67495] *** MPI_ERR_DIMS: invalid topology dimension
[shas0137:67495] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[shas0137:67495] *** and potentially your MPI job)
[shas0137:67491] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[shas0137:67491] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
设置"orte_base_help_aggregate"参数只是重复错误信息4次(每个进程一次)
大多数 MPI 例程在文档中都有错误代码列表,但 MPI_ERR_DIMS 未列出 MPI_Dims_create
版本详情:本地
$ mpic++ -v
Apple LLVM version 9.0.0 (clang-900.0.37)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
版本详情:研究集群
$ mpic++ -v
Using built-in specs.
COLLECT_GCC=/curc/sw/gcc/6.1.0/bin/g++
COLLECT_LTO_WRAPPER=/curc/sw/gcc/6.1.0/libexec/gcc/x86_64-pc-linux-gnu/6.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-6.1.0/configure --prefix=/curc/sw/gcc/6.1.0 --enable-languages=c,c++,fortran,go --disable-multilib --with-tune=intel
Thread model: posix
gcc version 6.1.0 (GCC)
MCVE
(或者尽可能接近可验证,因为这只发生在某些配置上)
#include <mpi.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
MPI_Status st;
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm grid_comm;
int size;
int err = MPI_Comm_size(comm, &size);
//Error handling
if (err != MPI_SUCCESS) {
return 1;
}
std::cout << size << std::endl;
//This call throws the error
int dims[3];
err = MPI_Dims_create(size, 3, dims);
//Error handling
if (err != MPI_SUCCESS) {
return 2;
}
MPI_Finalize();
return 0;
}
TLDR
我声明了但没有初始化 dim 数组。在我的本地机器上,我恰好得到了一块干净的内存。
为什么这很重要?
我没有意识到这很重要,部分原因是我没有意识到 dims 既是输入参数又是输出参数。我假设作为输出参数,dims 中的任何值都会被 MPI 例程覆盖。
我查阅的原始文档 (from MPICH) 非常简短。
MPI_Dims_create
Input/Output Parameters
dims
integer array of size ndims specifying the number of nodes in each dimension. A value of 0 indicates that MPI_Dims_create should fill in a suitable value. (emphasis mine)
即非零值不会被例程覆盖。
有关 MPI Forum MPI Documentation 的更多信息:
6.5.2. Cartesian Convenience Function: MPI_DIMS_CREATE
(...) The caller may further constrain the operation of this routine by specifying elements of array dims. If dims[i] is set to a positive number, the routine will not modify the number of nodes in dimension i; only those entries where dims[i] = 0 are modified by the call.
Negative input values of dims[i] are erroneous. (...)
简而言之,在研究集群上,未初始化的 dims 数组中的值要么为负数,要么大于我尝试分配的节点数。