如何修复在使用 mpi4py 的并行脚本中调用 subprocess.Popen 导致的 pickle.unpickling 错误

How to fix pickle.unpickling error caused by calls to subprocess.Popen in parallel script that uses mpi4py

在与 mpi4py 并行的脚本中重复串行调用 subprocess.Popen() 最终导致似乎是通信过程中的数据损坏,表现为各种类型的 pickle.unpickling 错误(我已经看到unpickling 错误:EOF、无效的 unicode 字符、无效的加载密钥、unpickling 堆栈下溢)。好像只有通信的数据量大,子进程串行调用量大,或者mpi进程数多的时候才会出现。

我可以使用 python>=2.7、mpi4py>=3.0.1 和 openmpi>=3.0.0 重现错误。我最终想传达 python 个对象,所以我使用小写的 mpi4py 方法。这是重现错误的最少代码:

#!/usr/bin/env python
from mpi4py import MPI
from copy import deepcopy
import subprocess

nr_calcs           = 4
tasks_per_calc     = 44
data_size          = 55000

# --------------------------------------------------------------------
def run_test(nr_calcs, tasks_per_calc, data_size):

    # Init MPI
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    comm_size = comm.Get_size()                                                                                                                             

    # Run Moc Calcs                                                                                                                                                            
    icalc = 0
    while True:
        if icalc > nr_calcs - 1: break
        index = icalc
        icalc += 1

        # Init Moc Tasks
        task_list = []
        moc_task = data_size*"x"
        if rank==0:
            task_list = [deepcopy(moc_task) for i in range(tasks_per_calc)]
        task_list = comm.bcast(task_list)

        # Moc Run Tasks
        itmp = rank
        while True:
            if itmp > len(task_list)-1: break
            itmp += comm_size
            proc = subprocess.Popen(["echo", "TEST CALL TO SUBPROCESS"],
                    stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
            out,err = proc.communicate()

        print("Rank {:3d} Finished Calc {:3d}".format(rank, index))

# --------------------------------------------------------------------
if __name__ == '__main__':
    run_test(nr_calcs, tasks_per_calc, data_size)

运行 在一个具有 44 个 mpi 进程的 44 核心节点上成功完成前 3 个 "calculations",但在最后一个循环中一些进程引发:

Traceback (most recent call last):
  File "./run_test.py", line 54, in <module>
    run_test(nr_calcs, tasks_per_calc, data_size)
  File "./run_test.py", line 39, in run_test
    task_list = comm.bcast(task_list)
  File "mpi4py/MPI/Comm.pyx", line 1257, in mpi4py.MPI.Comm.bcast
  File "mpi4py/MPI/msgpickle.pxi", line 639, in mpi4py.MPI.PyMPI_bcast
  File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
  File "mpi4py/MPI/msgpickle.pxi", line 101, in mpi4py.MPI.Pickle.cloads
_pickle.UnpicklingError

有时 UnpicklingError 有一个描述符,例如无效的加载键 "x",或 EOF 错误,无效的 unicode 字符,或 unpickling 堆栈下溢。

编辑:问题似乎在 openmpi<3.0.0 和使用 mvapich2 时消失了,但了解发生了什么仍然很好。

我遇到了同样的问题。就我而言,我通过在 Python 虚拟环境中安装 mpi4py 并按照 Intel 的建议设置 mpi4py.rc.recv_mprobe = False 来让我的代码工作: https://software.intel.com/en-us/articles/python-mpi4py-on-intel-true-scale-and-omni-path-clusters

然而,最后我只是切换到使用大写字母方法 RecvSend 与 NumPy 数组。他们与 subprocess 一起工作得很好,他们不需要任何额外的技巧。