Suse 系统上的 mpi4py 编译错误

mpi4py compilation error on a Suse system

在我用服务器的 openmpi 编译 mpi4py 后,出现运行时错误。

 OS: SuSe
 GCC: 4.8.5
 OpenMPI: 1.10.1
 HDF5: 1.8.11
 mpi4py: 2.0.0
 Python: 2.7.9

环境设置: 我使用 virtualenv(没有服务器的管理员权限)

(ENV) username@servername:~/test> echo $PATH
/opt/local/tools/hdf5/hdf5-1.8.11_openmpi-1.10.1_gcc-4.8.5/bin:/opt/local/mpi/openmpi/openmpi-1.10.1_gcc-4.8.5/bin:/home/username/test/virtualenv-15.0.3/ENV/bin: [other libs ] :/opt/local/bin:/usr/lib64/mpi/gcc/openmpi/bin:/usr/local/bin:/usr/bin:/bin

(ENV) username@servername:echo $LD_LIBRARY_PATH 
/opt/local/tools/hdf5/hdf5-1.8.11_openmpi-1.10.1_gcc-4.8.5/lib:/opt/local/mpi/openmpi/openmpi-1.10.1_gcc-4.8.5/lib


(ENV) username@servername:~/test> pip freeze
cycler==0.10.0
Cython==0.24.1
dill==0.2.5
matplotlib==1.5.3
multiprocessing==2.6.2.1
numpy==1.11.1
pyfits==3.4
pyparsing==2.1.9
python-dateutil==2.5.3
pytz==2016.6.1
scipy==0.18.1
six==1.10.0

编译安装mpi4py:

(ENV) username@servername:~/test> wget https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-2.0.0.tar.gz
(ENV) username@servername:~/test> tar xzvf mpi4py-2.0.0.tar.gz
(ENV) username@servername:~/test> cd mpi4py-2.0.0/
(ENV) username@servername:~/test>vim mpi.cfg

在 mpi.cfg 中,我为我的自定义 Open MPI 添加了一个部分:

[mpi]
mpi_dir              = /opt/local/mpi/openmpi/openmpi-1.10.1_gcc-4.8.5
mpicc                = %(mpi_dir)s/bin/mpicc
mpicxx               = %(mpi_dir)s/bin/mpicxx
library_dirs         = %(mpi_dir)s/lib
runtime_library_dirs = %(library_dirs)s

编译

(ENV) username@servername:python setup.py build --mpi=mpi

安装

(ENV) username@servername:python setup.py install

第一次基础测试(ok)

(ENV) username@servername: mpiexec -n 5 python -m mpi4py helloworld
Hello, World! I am process 0 of 5 on servername.
Hello, World! I am process 1 of 5 on servername.
Hello, World! I am process 2 of 5 on servername.
Hello, World! I am process 3 of 5 on servername.
Hello, World! I am process 4 of 5 on servername.

第二个基本测试生成错误:

(ENV) username@servername: python
>>>from mpi4py import MPI
--------------------------------------------------------------------------
Error obtaining unique transport key from ORTE orte_precondition_transports not present in the environment).

Local host: servername
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is likely to abort.  There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems.  This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer):

PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
-------------------------------------------------------------------------    
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[servername:165332] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
(ENV) username@servername:~/test/mpi4py-2.0.0> 

更新:在编译 mpi4py 时出现此错误

checking for library 'lmpe' ...
/opt/local/mpi/openmpi/openmpi-1.10.1_gcc-4.8.5/bin/mpicc -pthread 
-fno-strict-aliasing -fmessage-length=0 -grecord-gcc-switches -fstack-
protector -O2 -Wall -D_FORTIFY_SOURCE=2 -funwind-tables -fasynchronous-
unwind-tables -g -DNDEBUG -fmessage-length=0 -grecord-gcc-switches 
-fstack-protector -O2 -Wall -D_FORTIFY_SOURCE=2 -funwind-tables 
-fasynchronous-unwind tables -g -DOPENSSL_LOAD_CONF -fPIC -I/opt/local
/mpi/openmpi/openmpi-1.10.1_gcc-4.8.5/include -c _configtest.c -o 
_configtest.o
/opt/local/mpi/openmpi/openmpi-1.10.1_gcc-4.8.5/bin/mpicc -pthread _configtest.o -L/opt/local/mpi/openmpi/openmpi-1.10.1_gcc-4.8.5/lib -Wl,-R/opt/local/mpi/openmpi/openmpi-1.10.1_gcc-4.8.5/lib -llmpe -o _configtest
/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux 
bin/ld:   cannot find -llmpe
collect2: error: ld returned 1 exit status
failure.

参见:https://bitbucket.org/mpi4py/mpi4py/issues/52/mpi4py-compilation-error

问题似乎不是因为 mpi4py 错误,而是来自 OpenMPI 的 psm 传输层

在我的例子中设置

export OMPI_MCA_mtl=^psm

解决了上述运行时错误。