mpi4py irecv 导致段错误

mpi4py irecv causes segmentation fault

我 运行 遵循使用命令 mpirun -n 2 python -u test_irecv.py > output 2>&1.

将数组从 rank 0 发送到 1 的代码
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
asyncr = 1
size_arr = 10000

if comm.Get_rank()==0:
    arrs = np.zeros(size_arr)
    if asyncr: comm.isend(arrs, dest=1).wait()
    else: comm.send(arrs, dest=1)
else:
    if asyncr: arrv = comm.irecv(source=0).wait()
    else: arrv = comm.recv(source=0)

print('Done!', comm.Get_rank())

运行 在同步模式下与 asyncr = 0 给出预期的输出

Done! 0
Done! 1

但是 运行 在异步模式下 asyncr = 1 给出如下错误。 我需要知道为什么它在同步模式下运行正常,而在异步模式下运行不正常。

输出 asyncr = 1:

Done! 0
[nia1477:420871:0:420871] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x138)
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000643d1 ompi_errhandler_request_invoke()  ???:0
 2 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
 3 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
 4 0x000000000008a8b5 __pyx_pf_6mpi4py_3MPI_7Request_34wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83838
 5 0x000000000008a8b5 __pyx_pw_6mpi4py_3MPI_7Request_35wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83813
 6 0x00000000000966a3 _PyMethodDef_RawFastCallKeywords()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/call.c:690
 7 0x000000000009eeb9 _PyMethodDescr_FastCallKeywords()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/descrobject.c:288
 8 0x000000000006e611 call_function()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:4563
 9 0x000000000006e611 _PyEval_EvalFrameDefault()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3103
10 0x0000000000177644 _PyEval_EvalCodeWithName()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3923
11 0x000000000017774e PyEval_EvalCodeEx()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3952
12 0x000000000017777b PyEval_EvalCode()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:524
13 0x00000000001aab72 run_mod()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:1035
14 0x00000000001aab72 PyRun_FileExFlags()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:988
15 0x00000000001aace6 PyRun_SimpleFileExFlags()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:430
16 0x00000000001cad47 pymain_run_file()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:425
17 0x00000000001cad47 pymain_run_filename()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:1520
18 0x00000000001cad47 pymain_run_python()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2520
19 0x00000000001cad47 pymain_main()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2662
20 0x00000000001cb1ca _Py_UnixMain()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2697
21 0x00000000000202e0 __libc_start_main()  ???:0
22 0x00000000004006ba _start()  /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 420871 on node nia1477 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

版本如下:

运行 asyncr = 1 在另一个系统中 MPICH 给出了以下输出。

Done! 0
Traceback (most recent call last):
  File "test_irecv.py", line 14, in <module>
    if asyncr: arrv = comm.irecv(source=0).wait()
  File "mpi4py/MPI/Request.pyx", line 235, in mpi4py.MPI.Request.wait
  File "mpi4py/MPI/msgpickle.pxi", line 411, in mpi4py.MPI.PyMPI_wait
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[23830,1],1]
  Exit code:    1
--------------------------------------------------------------------------
[master:01977] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[master:01977] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

显然这是 mpi4py 中的一个已知问题,如 https://bitbucket.org/mpi4py/mpi4py/issues/65/mpi_err_truncate-message-truncated-when 中所述。 Lisandro Dalcin 说

The implementation of irecv() for large messages requires users to pass a buffer-like object large enough to receive the pickled stream. This is not documented (as most of mpi4py), and even non-obvious and unpythonic...

解决方法是将足够大的预分配 bytearray 传递给 irecv。一个工作示例如下。

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
size_arr = 10000

if comm.Get_rank()==0:
    arrs = np.zeros(size_arr)
    comm.isend(arrs, dest=1).wait()
else:
    arrv = comm.irecv(bytearray(1<<20), source=0).wait()

print('Done!', comm.Get_rank())