OpenMPI 分段错误:地址未映射
OpenMPI Segmentation fault: address not mapped
在开发基于 OpenMPI 的程序期间,我有时遇到分段错误:
[11655] *** Process received signal ***
[11655] Signal: Segmentation fault (11)
[11655] Signal code: Address not mapped (1)
[11655] Failing at address: 0x10
[11655] [ 0] /usr/lib/libpthread.so.0(+0x11940)[0x7fe42b159940]
[11655] [ 1] /usr/lib/openmpi/openmpi/mca_btl_vader.so(mca_btl_vader_alloc+0xde)[0x7fe41e94717e]
[11655] [ 2] /usr/lib/openmpi/openmpi/mca_btl_vader.so(mca_btl_vader_sendi+0x22d)[0x7fe41e949c5d]
[11655] [ 3] /usr/lib/openmpi/openmpi/mca_pml_ob1.so(+0x806f)[0x7fe41e30806f]
[11655] [ 4] /usr/lib/openmpi/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x3d9)[0x7fe41e308f29]
[11655] [ 5] /usr/lib/openmpi/libmpi.so.12(MPI_Send+0x11c)[0x7fe42b3df1cc]
[11655] [ 6] project[0x400e41]
[11655] [ 7] project[0x401429]
[11655] [ 8] project[0x400cdc]
[11655] [ 9] /usr/lib/libc.so.6(__libc_start_main+0xea)[0x7fe42adc343a]
[11655] [10] project[0x400b3a]
[11655] *** End of error message ***
[11670] *** Process received signal ***
[11670] Signal: Segmentation fault (11)
[11670] Signal code: Address not mapped (1)
[11670] Failing at address: 0x1ede1f0
[11670] [ 0] /usr/lib/libpthread.so.0(+0x11940)[0x7fc5f8c13940]
[11670] [ 1] /usr/lib/openmpi/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x14c)[0x7fc5ec458aac]
[11670] [ 2] /usr/lib/openmpi/openmpi/mca_btl_vader.so(+0x3c9e)[0x7fc5ec458c9e]
[11670] [ 3] /usr/lib/openmpi/libopen-pal.so.13(opal_progress+0x4a)[0x7fc5f836814a]
[11670] [ 4] /usr/lib/openmpi/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x255)[0x7fc5ebe171c5]
[11670] [ 5] /usr/lib/openmpi/libmpi.so.12(MPI_Recv+0x190)[0x7fc5f8e917d0]
[11670] [ 6] project[0x400d94]
[11670] [ 7] project[0x400e8a]
[11670] [ 8] /usr/lib/libpthread.so.0(+0x7297)[0x7fc5f8c09297]
[11670] [ 9] /usr/lib/libc.so.6(clone+0x3f)[0x7fc5f894a25f]
根据这些消息,我想我对 MPI_Send
和(对应的?)MPI_Recv
的使用存在一些错误。我使用这样的包装器:
void mpi_send(int *buf, int to, int tag) {
int msg[2];
msg[0] = l_clock++;
msg[1] = *buf;
MPI_Send(msg, 2, MPI_INT, to, tag, MPI_COMM_WORLD);
}
int mpi_rcv(int *buf, int source, int tag, MPI_Status *status) {
int msg[2];
MPI_Recv(msg, 2, MPI_INT, source, tag, MPI_COMM_WORLD, status);
int r_clock = msg[0];
*buf = msg[1];
if (r_clock > l_clock) {
l_clock = r_clock + 1;
return 1;
}
if (r_clock == l_clock) {
return rank < status->MPI_SOURCE;
}
return 0;
}
完整代码托管于此。
我看不出我在这里犯的错误。任何帮助将不胜感激。
编辑:
现在我注意到,段错误有时会提到 MPI_Barrier
。这对我来说完全没有意义。这是否意味着我的 OpenMPI 实现有问题?我正在使用 Manjaro Linux 和从 arm extra 存储库安装的 openmpi。
堆栈跟踪中有第二个线程,暗示在线程程序中使用 MPI。快速查看您的完整代码可以确认这一点。为了在这种情况下使用 MPI,必须通过调用 MPI_Init_thread()
而不是 MPI_Init()
来正确初始化它。如果您想从不同的线程同时进行多个 MPI 调用,传递给 MPI_Init_thread
的线程级别应该是 MPI_THREAD_MULTIPLE
:
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided < MPI_THREAD_MULTIPLE) {
// Error - MPI does not provide needed threading level
}
任何低于 MPI_THREAD_MULTIPLE
的线程级别(在 provided
中返回)都不适用于您的情况。
对 MPI_THREAD_MULTIPLE
的支持是 Open MPI 中的构建时选项。检查 Manjaro 包是否已相应编译。 Arch Linux 中的那个不是:
$ ompi_info
...
Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes,
^^^^^^^^^^^^^^^^^^^^^^^
OMPI progress: no, ORTE progress: yes, Event lib:
yes)
...
您可能需要从源代码构建 Open MPI 并启用对 MPI_THREAD_MULTIPLE
的支持。
在开发基于 OpenMPI 的程序期间,我有时遇到分段错误:
[11655] *** Process received signal ***
[11655] Signal: Segmentation fault (11)
[11655] Signal code: Address not mapped (1)
[11655] Failing at address: 0x10
[11655] [ 0] /usr/lib/libpthread.so.0(+0x11940)[0x7fe42b159940]
[11655] [ 1] /usr/lib/openmpi/openmpi/mca_btl_vader.so(mca_btl_vader_alloc+0xde)[0x7fe41e94717e]
[11655] [ 2] /usr/lib/openmpi/openmpi/mca_btl_vader.so(mca_btl_vader_sendi+0x22d)[0x7fe41e949c5d]
[11655] [ 3] /usr/lib/openmpi/openmpi/mca_pml_ob1.so(+0x806f)[0x7fe41e30806f]
[11655] [ 4] /usr/lib/openmpi/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x3d9)[0x7fe41e308f29]
[11655] [ 5] /usr/lib/openmpi/libmpi.so.12(MPI_Send+0x11c)[0x7fe42b3df1cc]
[11655] [ 6] project[0x400e41]
[11655] [ 7] project[0x401429]
[11655] [ 8] project[0x400cdc]
[11655] [ 9] /usr/lib/libc.so.6(__libc_start_main+0xea)[0x7fe42adc343a]
[11655] [10] project[0x400b3a]
[11655] *** End of error message ***
[11670] *** Process received signal ***
[11670] Signal: Segmentation fault (11)
[11670] Signal code: Address not mapped (1)
[11670] Failing at address: 0x1ede1f0
[11670] [ 0] /usr/lib/libpthread.so.0(+0x11940)[0x7fc5f8c13940]
[11670] [ 1] /usr/lib/openmpi/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x14c)[0x7fc5ec458aac]
[11670] [ 2] /usr/lib/openmpi/openmpi/mca_btl_vader.so(+0x3c9e)[0x7fc5ec458c9e]
[11670] [ 3] /usr/lib/openmpi/libopen-pal.so.13(opal_progress+0x4a)[0x7fc5f836814a]
[11670] [ 4] /usr/lib/openmpi/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x255)[0x7fc5ebe171c5]
[11670] [ 5] /usr/lib/openmpi/libmpi.so.12(MPI_Recv+0x190)[0x7fc5f8e917d0]
[11670] [ 6] project[0x400d94]
[11670] [ 7] project[0x400e8a]
[11670] [ 8] /usr/lib/libpthread.so.0(+0x7297)[0x7fc5f8c09297]
[11670] [ 9] /usr/lib/libc.so.6(clone+0x3f)[0x7fc5f894a25f]
根据这些消息,我想我对 MPI_Send
和(对应的?)MPI_Recv
的使用存在一些错误。我使用这样的包装器:
void mpi_send(int *buf, int to, int tag) {
int msg[2];
msg[0] = l_clock++;
msg[1] = *buf;
MPI_Send(msg, 2, MPI_INT, to, tag, MPI_COMM_WORLD);
}
int mpi_rcv(int *buf, int source, int tag, MPI_Status *status) {
int msg[2];
MPI_Recv(msg, 2, MPI_INT, source, tag, MPI_COMM_WORLD, status);
int r_clock = msg[0];
*buf = msg[1];
if (r_clock > l_clock) {
l_clock = r_clock + 1;
return 1;
}
if (r_clock == l_clock) {
return rank < status->MPI_SOURCE;
}
return 0;
}
完整代码托管于此。
我看不出我在这里犯的错误。任何帮助将不胜感激。
编辑:
现在我注意到,段错误有时会提到 MPI_Barrier
。这对我来说完全没有意义。这是否意味着我的 OpenMPI 实现有问题?我正在使用 Manjaro Linux 和从 arm extra 存储库安装的 openmpi。
堆栈跟踪中有第二个线程,暗示在线程程序中使用 MPI。快速查看您的完整代码可以确认这一点。为了在这种情况下使用 MPI,必须通过调用 MPI_Init_thread()
而不是 MPI_Init()
来正确初始化它。如果您想从不同的线程同时进行多个 MPI 调用,传递给 MPI_Init_thread
的线程级别应该是 MPI_THREAD_MULTIPLE
:
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided < MPI_THREAD_MULTIPLE) {
// Error - MPI does not provide needed threading level
}
任何低于 MPI_THREAD_MULTIPLE
的线程级别(在 provided
中返回)都不适用于您的情况。
对 MPI_THREAD_MULTIPLE
的支持是 Open MPI 中的构建时选项。检查 Manjaro 包是否已相应编译。 Arch Linux 中的那个不是:
$ ompi_info
...
Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes,
^^^^^^^^^^^^^^^^^^^^^^^
OMPI progress: no, ORTE progress: yes, Event lib:
yes)
...
您可能需要从源代码构建 Open MPI 并启用对 MPI_THREAD_MULTIPLE
的支持。