为什么我会收到使用 MPI 屏障 [c++] 的致命错误
Why am I receiving a fatal error using MPI barriers [c++]
我是 MPI 的新手,在尝试使用障碍时遇到了致命错误。我有一个简单的 for 循环,它以循环方式将索引分配给每个进程,紧接着是一个 MPI 屏障:
mpi.cc
#include <iostream>
#include <mpi.h>
#include <vector>
#include <sstream>
int main() {
int name_len, rank, comm_size;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(NULL, NULL);
MPI_Get_processor_name(processor_name, &name_len);
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &comm_size);
std::stringstream ss;
ss << "hello from: " << processor_name << " " << "Rank: " << rank << " Comm size: " << comm_size << "\n";
for (int i =0; i < 20; i++) {
if (i%comm_size != rank) continue;
ss << " " << i << "\n";
}
MPI_Barrier(comm); // Fails here
std::cout << ss.str();
MPI_Finalize();
}
我编译:
mpicxx mpi.cc -o mpi
然后 运行 在我的 2 节点集群上使用:
mpirun -ppn 1 --hosts node1,node2 ./mpi
我收到以下错误:
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(414).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(321)................: Failure during collective
MPIR_Barrier_impl(316)................:
MPIR_Barrier(281).....................:
MPIR_Barrier_intra(162)...............:
MPIDU_Complete_posted_with_error(1137): Process failed
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(414).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(321)................: Failure during collective
MPIR_Barrier_impl(316)................:
MPIR_Barrier(281).....................:
MPIR_Barrier_intra(162)...............:
MPIDU_Complete_posted_with_error(1137): Process failed
运行 在一个节点上工作,但在 运行 上 2 时失败。我有什么地方可能出错的想法吗?
我设法解决了我的问题。而不是
mpirun -ppn 1 --hosts node1,node2 ./mpi
我明确分别使用了node1和node2的ip地址,现在没有问题了。看来问题出在我的 /etc/hosts 文件上:
127.0.0.1 localhost
127.0.0.1 node1
主机似乎试图访问本地主机而不是节点 1。更多信息 here.
我是 MPI 的新手,在尝试使用障碍时遇到了致命错误。我有一个简单的 for 循环,它以循环方式将索引分配给每个进程,紧接着是一个 MPI 屏障:
mpi.cc
#include <iostream>
#include <mpi.h>
#include <vector>
#include <sstream>
int main() {
int name_len, rank, comm_size;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(NULL, NULL);
MPI_Get_processor_name(processor_name, &name_len);
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &comm_size);
std::stringstream ss;
ss << "hello from: " << processor_name << " " << "Rank: " << rank << " Comm size: " << comm_size << "\n";
for (int i =0; i < 20; i++) {
if (i%comm_size != rank) continue;
ss << " " << i << "\n";
}
MPI_Barrier(comm); // Fails here
std::cout << ss.str();
MPI_Finalize();
}
我编译:
mpicxx mpi.cc -o mpi
然后 运行 在我的 2 节点集群上使用:
mpirun -ppn 1 --hosts node1,node2 ./mpi
我收到以下错误:
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(414).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(321)................: Failure during collective
MPIR_Barrier_impl(316)................:
MPIR_Barrier(281).....................:
MPIR_Barrier_intra(162)...............:
MPIDU_Complete_posted_with_error(1137): Process failed
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(414).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(321)................: Failure during collective
MPIR_Barrier_impl(316)................:
MPIR_Barrier(281).....................:
MPIR_Barrier_intra(162)...............:
MPIDU_Complete_posted_with_error(1137): Process failed
运行 在一个节点上工作,但在 运行 上 2 时失败。我有什么地方可能出错的想法吗?
我设法解决了我的问题。而不是
mpirun -ppn 1 --hosts node1,node2 ./mpi
我明确分别使用了node1和node2的ip地址,现在没有问题了。看来问题出在我的 /etc/hosts 文件上:
127.0.0.1 localhost
127.0.0.1 node1
主机似乎试图访问本地主机而不是节点 1。更多信息 here.